Preparing dataset: Inserting rows

Roman Mostazir

Join Date: Apr 2014
Posts: 874

Preparing dataset: Inserting rows

03 Mar 2015, 14:22

Dear all,

I have the following data set in long format where individuals are repeatedly measured both in baseline and in followup as shown in the variable 'pre'. The variable 'dayno' refers to the first day at both occasion. 'dcount' refers the number of measurements on each day. As can be seen that the number of measurements are not same on first day for both occasions.

Problem:

- My question is how can I insert two rows that are short in baseline compared to the followup.

Other information:

- 'Dayno' varies from 1-6

- In other 'id' it could be a reversal situation, i.e 'dcount' higher in baseline and lower in follow up.

-Naturally the new rows will create missing values which is not of concern.

Code:

   +--------------------------------------------+
     | id         pre   dayno   dcount    outcome |
     |--------------------------------------------|
  1. |  1    Baseline       1        4   3.571429 |
  2. |  1    Baseline       1        4   4.571429 |
  3. |  1    Baseline       1        4   2.857143 |
  4. |  1    Baseline       1        4   3.857143 |
     |--------------------------------------------|
  5. |  1   Follow up       1        6          6 |
  6. |  1   Follow up       1        6          6 |
  7. |  1   Follow up       1        6   5.714286 |
  8. |  1   Follow up       1        6   5.714286 |
  9. |  1   Follow up       1        6   5.857143 |
 10. |  1   Follow up       1        6          6 |
     |--------------------------------------------|
 11. |  1    Baseline       2        8          5 |
 12. |  1    Baseline       2        8   4.571429 |
 13. |  1    Baseline       2        8   5.857143 |
 14. |  1    Baseline       2        8   4.571429 |
 15. |  1    Baseline       2        8   4.571429 |
 16. |  1    Baseline       2        8   4.857143 |
 17. |  1    Baseline       2        8   5.285714 |
 18. |  1    Baseline       2        8          3 |
     |--------------------------------------------|
 19. |  1   Follow up       2        7          6 |
 20. |  1   Follow up       2        7   6.285714 |
 21. |  1   Follow up       2        7   5.571429 |
 22. |  1   Follow up       2        7   6.142857 |
 23. |  1   Follow up       2        7          6 |
 24. |  1   Follow up       2        7   5.714286 |
 25. |  1   Follow up       2        7          6 |
     +--------------------------------------------+

Your help is highly appreciated and many thanks in advance.

Regards,

Last edited by Roman Mostazir; 03 Mar 2015, 14:29.

Roman

Tags: None

Carole J. Wilson

Join Date: Jan 2015
Posts: 932

03 Mar 2015, 19:21

Here's a possible solution. It uses -fillin- to add more observations. But, because it will add too many obs (equivalent to the largest group), we have to drop some in the end. We figure out how many we have to keep by comparing it to the new variable "max" that is the largest value within id/dayno. Values of id, pre, dayno, and max are filled in using David Kantor's program -carryforward- from http://fmwww.bc.edu/RePEc/bocode/c (findit carryforward).

I added a second value of id for proof of concept.

Code:

clear
input id     str18  pre   dayno   dcount    outcome
1    "Baseline"      1        4   3.571429
1    "Baseline"      1        4   4.571429
1    "Baseline"      1        4   2.857143
1    "Baseline"      1        4   3.857143
1    "Follow up"     1        6          6
1    "Follow up"     1        6          6
1    "Follow up"     1        6   5.714286
1    "Follow up"     1        6   5.714286
1    "Follow up"     1        6   5.857143
1    "Follow up"     1        6          6
1    "Baseline"      2        8          5
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   5.857143
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.857143
1    "Baseline"      2        8   5.285714
1    "Baseline"      2        8          3
1    "Follow up"     2        7          6
1    "Follow up"     2        7   6.285714
1    "Follow up"     2        7   5.571429
1    "Follow up"     2        7   6.142857
1    "Follow up"     2        7          6
1    "Follow up"     2        7   5.714286
1    "Follow up"     2        7          6
2    "Baseline"      1        4   3.571429
2    "Baseline"      1        4   4.571429
2    "Baseline"      1        4   2.857143
2    "Baseline"      1        4   3.857143
2    "Follow up"     1        6          6
2    "Follow up"     1        6          6
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.857143
2    "Follow up"     1        6          6
2    "Baseline"      2        8          5
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   5.857143
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.857143
2    "Baseline"      2        8   5.285714
2    "Baseline"      2        8          3
2    "Follow up"     2        7          6
2    "Follow up"     2        7   6.285714
2    "Follow up"     2        7   5.571429
2    "Follow up"     2        7   6.142857
2    "Follow up"     2        7          6
2    "Follow up"     2        7   5.714286
2    "Follow up"     2        7          6
end


egen group=group(id dayno pre)
bysort group: gen count= _n
egen max=max(dcount), by(id dayno)
fillin group count
carryforward id pre dayno max, replace
drop if count>max
list, nol sepby(id  dayno pre )

Last edited by Carole J. Wilson; 03 Mar 2015, 19:24.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#3

03 Mar 2015, 19:57

Dear Carole,

Many thanks. Your solution works, but did not solve the problem which is not the problem of your solution rather problem of the people who managed the dataset. Considering the nature of the study and the data, the analysis is going to be complex and I must say with such complexities, this is the worst maintained dataset I possibly ever had. Getting to the point, the reason for which the solution did not meet the expectation is that some of the id's are randomly missing at follow up !!! If the solution was perfect, the total number of 'id' should equal at Baseline and Follow up, but a frequency distribution (by pre) suggests they differ by an amount of 418. I already with my naked eye spotted couple of id's those are missing at follow up. I need to sort out first which id's are missing at follow up before I reapply your solution. But many thanks for your time.

All the best.

Roman
Comment

Carole J. Wilson

Join Date: Jan 2015
Posts: 932

03 Mar 2015, 20:50

Hi Roman,
I added some observations where Followup is missing or Baseline is missing. This code will check if there is both cases. If one set is missing, the missing cases will be added (outcome set to missing).

If the variable sumtag is 1, then something is missing.
-expand- duplicates those cases and changes the value of pre
The new observations created by expand have a value of 1 for the variable expand.

Hopefully this will reduce your efforts to search through the data.
--Carole

Code:

clear
input id     str18  pre   dayno   dcount    outcome
1    "Baseline"      1        4   3.571429
1    "Baseline"      1        4   4.571429
1    "Baseline"      1        4   2.857143
1    "Baseline"      1        4   3.857143
1    "Baseline"      2        8          5
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   5.857143
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.857143
1    "Baseline"      2        8   5.285714
1    "Baseline"      2        8          3
1    "Follow up"     2        7          6
1    "Follow up"     2        7   6.285714
1    "Follow up"     2        7   5.571429
1    "Follow up"     2        7   6.142857
1    "Follow up"     2        7          6
1    "Follow up"     2        7   5.714286
1    "Follow up"     2        7          6
2    "Baseline"      1        4   3.571429
2    "Baseline"      1        4   4.571429
2    "Baseline"      1        4   2.857143
2    "Baseline"      1        4   3.857143
2    "Follow up"     1        6          6
2    "Follow up"     1        6          6
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.857143
2    "Follow up"     1        6          6
2    "Baseline"      2        8          5
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   5.857143
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.857143
2    "Baseline"      2        8   5.285714
2    "Baseline"      2        8          3
3    "Follow up"     2        7          6
3    "Follow up"     2        7   6.285714
3    "Follow up"     2        7   5.571429
3    "Follow up"     2        7   6.142857
3    "Follow up"     2        7          6
3    "Follow up"     2        7   5.714286
3    "Follow up"     2        7          6

end



egen tag=tag(id dayno pre)
egen sumtag=sum(tag), by(id dayno)
expand 2 if sumtag==1, gen(expand)
replace outcome=. if expand==1

gen type=.
replace type=1 if expand==1 & pre=="Follow up"
replace type=2 if expand==1 & pre=="Baseline"

*if Followup is missing
replace pre="Follow up" if type==2
*if Baseline is missing
replace pre="Baseline" if type==1




egen group=group(id dayno pre)
bysort group: gen count= _n
egen max=max(dcount), by(id dayno)
fillin group count
carryforward id pre dayno max, replace
drop if count>max
sort id dayno pre
list, nol sepby(id  dayno pre )

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

04 Mar 2015, 08:30

Using Carole's data setup, here's another approach.

Code:

clear
input id     str18  pre   dayno   dcount    outcome
1    "Baseline"      1        4   3.571429
1    "Baseline"      1        4   4.571429
1    "Baseline"      1        4   2.857143
1    "Baseline"      1        4   3.857143
1    "Baseline"      2        8          5
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   5.857143
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.571429
1    "Baseline"      2        8   4.857143
1    "Baseline"      2        8   5.285714
1    "Baseline"      2        8          3
1    "Follow up"     2        7          6
1    "Follow up"     2        7   6.285714
1    "Follow up"     2        7   5.571429
1    "Follow up"     2        7   6.142857
1    "Follow up"     2        7          6
1    "Follow up"     2        7   5.714286
1    "Follow up"     2        7          6
2    "Baseline"      1        4   3.571429
2    "Baseline"      1        4   4.571429
2    "Baseline"      1        4   2.857143
2    "Baseline"      1        4   3.857143
2    "Follow up"     1        6          6
2    "Follow up"     1        6          6
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.714286
2    "Follow up"     1        6   5.857143
2    "Follow up"     1        6          6
2    "Baseline"      2        8          5
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   5.857143
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.571429
2    "Baseline"      2        8   4.857143
2    "Baseline"      2        8   5.285714
2    "Baseline"      2        8          3
3    "Follow up"     2        7          6
3    "Follow up"     2        7   6.285714
3    "Follow up"     2        7   5.571429
3    "Follow up"     2        7   6.142857
3    "Follow up"     2        7          6
3    "Follow up"     2        7   5.714286
3    "Follow up"     2        7          6
end

* retain the original order (in case it matters)
gen obs = _n
bysort id dayno pre (obs): replace obs = _n
tempfile main
save "`main'"

* Consistency check
bysort id dayno pre: assert dcount == _N

* rebuild the structure
collapse (max) dcount, by(id dayno)
gen pre = "Baseline"
expand 2
bysort id dayno: replace pre = "Follow up" if _n == 2
expand dcount
bysort id dayno pre: gen obs = _n

* merge the expected structure with the original data
merge 1:1 id dayno pre obs using "`main'", assert(master match) nogen

sort id dayno pre obs
list, sepby(id dayno pre)

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#6

04 Mar 2015, 09:55

Thank you so much Carole, nearly matched everything. I had to sort out someother hidden dynamics (you don't want to hear all those fuss), but yes finally it is done. Thank you again for your time and effort.
All the best.

Roman
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#7

04 Mar 2015, 11:18

Many thanks Robert, this looks really neat and less verbose approachand the idea of merging is even more interesting. Thanks for your help.

Roman
Comment

Announcement

Preparing dataset: Inserting rows

Comment

Comment

Comment

Comment

Comment

Comment