Merging long format datasets depending on observation date within ID

Anne Christensen

Join Date: Apr 2023

Posts: 28
#1

Merging long format datasets depending on observation date within ID

24 Apr 2023, 13:33

Hi,
I would like to know if there is an easy/easier and more systematic way of merging my datasets (m:m) so that outcomes are med not only within ID but also within dates - I hope this example helps:

I have several outcomes that I would like to have in the same dataset. They are all measured multiple times within ID, some every day, some once a week, some once a month.

An example (sorry I don't have access to my data right now, so can't use dataex)

Outcome 1:

ID outcome1 date1
1 56 07sep22
1 87 08sep22
1 76 09sep22
1 89 10sep22
1 63 11sep22
1 76 12 sep22
(etc...)
2 54 07sep22
2 76 08sep22

Outcome 2:
ID outcome2 date2
1 5 07sep22
1 6 12sep22
1 1 19sep22
1 3 26sep22
(etc...)
2 6 7sep22
2 9 12sep22

Is there a way in which I can merge on date within ID so to speak, so that it merges like this:
ID outome1 date1 outcome 2
1 56 07sep22 5
1 87 08sep22 .
1 76 09sep22 .
1 89 10sep22 .
1 63 11sep22 .
1 76 12 sep22 6
etc...
2 54 07sep22 6
2 76 08sep22 .

Currently I can only make them merge like this:
ID outcome1 date1 outcome 2 date2
1 56 07sep22 5 07sep22
1 87 08sep22 6 12sep22
1 76 09sep22 1 19sep22
1 89 10sep22 3 16sep22
1 63 11sep22 .
1 76 12 sep22 .
etc...
2 54 07sep22
2 76 08sep22

So essentially I would like for there to be a row for each observation of outcome1 and for the observations of outcome2 to be inserted on the corresponding date - and for outcome 2 to otherwise be missing.

A related question: I find it confusing when missing values are propagated (I think is the term) with the last observed value when merging m:m, eg if I have 50 observations for outcome1 ID 1 but only 30 on outcome2, stata will just 'repeat' the values of the last row of the outcome 2 dataset when I merge it with outcome1 - is there are way to avoid this? Or should they always be nonmissing when I need to do multilevel mixed-effects linear regression (mixed-command)?

Thank you very much in advance - if helpful I can try to produce a data example.

Last edited by Anne Christensen; 24 Apr 2023, 13:37.
Tags: dates, panel data, Time Series
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

24 Apr 2023, 13:45

I would like to know if there is an easy/easier and more systematic way of merging my datasets (m:m)

Given the -merge m:m- is the worst possible way to combine two data sets, and in almost all circumstances produces data salad, yes, any other way would be better! Seriously, NEVER use -merge m:m-. Whenever you think you need it, either your data sets are wrong or you don't understand your data and are missing how to properly merge them (or both).

Code:

use dataset2, clear rename date2 date1 merge 1:1 ID date1 using dataset1 rename date1 date

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Anne Christensen

Join Date: Apr 2023
Posts: 28

25 Apr 2023, 01:17

Dear Clyde
Thank you for your reply.
I have seen this advice against m:m several times, but when I have long format with ID not uniquely identifying individuals in both datasets, I don't see how 1:1 could work?
The suggested code did not work.

I am now with my data and have here examples of both datasets using daatex:

Data 1

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 id float totalsleeptime long date1
""             .     .
"kRaGTz5jis" 340 22897
"kRaGTz5jis" 345 22898
"kRaGTz5jis" 480 22899
"kRaGTz5jis" 380 22900
"kRaGTz5jis" 335 22901
"kRaGTz5jis" 335 22902
"kRaGTz5jis" 370 22903
"kRaGTz5jis" 370 22904
"kRaGTz5jis" 240 22905
"kRaGTz5jis" 330 22906
"kRaGTz5jis" 335 22907
"kRaGTz5jis" 350 22908
"kRaGTz5jis" 335 22909
"kRaGTz5jis" 350 22910
"kRaGTz5jis" 350 22911
"kRaGTz5jis" 355 22912
"kRaGTz5jis" 365 22913
"kRaGTz5jis" 340 22914
"kRaGTz5jis" 350 22915
end
format %td date1

Data 2

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 id byte isiscore long date2
""            .     .
"kRaGTz5jis" 13 22896
"kRaGTz5jis" 13 22903
"kRaGTz5jis"  7 22910
"kRaGTz5jis"  6 22917
"kRaGTz5jis"  7 22924
"kRaGTz5jis"  5 22931
"kRaGTz5jis"  6 22938
"kRaGTz5jis"  8 22945
"kRaGTz5jis"  2 22952
"kRaGTz5jis"  2 22959
"kRaGTz5jis"  2 22966
"kRaGTz5jis"  1 22973
"kRaGTz5jis"  0 22980
"kRaGTz5jis"  0 22987
"kRaGTz5jis"  2 22994
"kRaGTz5jis"  0 23002
"kRaGTz5jis"  2 23009
"kRaGTz5jis"  0 23018
"kRaGTz5jis"  0 23026
end
format %td date2

Thank you,
Anne

Comment

Anne Christensen

Join Date: Apr 2023
Posts: 28

25 Apr 2023, 01:34

Update: I think merge
m:m email date1 using "date2", generate(date1_merge)
has partly done the trick, as this produces:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 id float totalsleeptime byte isiscore long date1
""             .  .     .
"kRaGTz5jis" 340  . 22897
"kRaGTz5jis" 345  . 22898
"kRaGTz5jis" 480  . 22899
"kRaGTz5jis" 380  . 22900
"kRaGTz5jis" 335  . 22901
"kRaGTz5jis" 335  . 22902
"kRaGTz5jis" 370 13 22903
"kRaGTz5jis" 370  . 22904
"kRaGTz5jis" 240  . 22905
"kRaGTz5jis" 330  . 22906
"kRaGTz5jis" 335  . 22907
"kRaGTz5jis" 350  . 22908
"kRaGTz5jis" 335  . 22909
"kRaGTz5jis" 350  7 22910
"kRaGTz5jis" 350  . 22911
"kRaGTz5jis" 355  . 22912
"kRaGTz5jis" 365  . 22913
"kRaGTz5jis" 340  . 22914
"kRaGTz5jis" 350  . 22915
end
format %td date1

However, I see that IDs are not merged if they do not have observations regsitered on the same date - which makes sense of course since I am also merging on date.

Will leave this open to hear if anyone has a suggestion how to solve this.

Last edited by Anne Christensen; 25 Apr 2023, 01:50.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30174

25 Apr 2023, 13:19

What do you mean when you say that the code in #2 "did not work." It runs on the example data you showed:

Code:

. * Example generated by -dataex-. For more info, type help dataex
. clear

. input str10 id float totalsleeptime long date1

             id  totalsl~e         date1
  1. ""             .     .
  2. "kRaGTz5jis" 340 22897
  3. "kRaGTz5jis" 345 22898
  4. "kRaGTz5jis" 480 22899
  5. "kRaGTz5jis" 380 22900
  6. "kRaGTz5jis" 335 22901
  7. "kRaGTz5jis" 335 22902
  8. "kRaGTz5jis" 370 22903
  9. "kRaGTz5jis" 370 22904
 10. "kRaGTz5jis" 240 22905
 11. "kRaGTz5jis" 330 22906
 12. "kRaGTz5jis" 335 22907
 13. "kRaGTz5jis" 350 22908
 14. "kRaGTz5jis" 335 22909
 15. "kRaGTz5jis" 350 22910
 16. "kRaGTz5jis" 350 22911
 17. "kRaGTz5jis" 355 22912
 18. "kRaGTz5jis" 365 22913
 19. "kRaGTz5jis" 340 22914
 20. "kRaGTz5jis" 350 22915
 21. end

. format %td date1

. tempfile dataset1

. save `dataset1'
file C:\Users\clyde\AppData\Local\Temp\ST_7c88_000001.tmp saved as .dta format

.
. * Example generated by -dataex-. For more info, type help dataex
. clear

. input str10 id byte isiscore long date2

             id  isiscore         date2
  1. ""            .     .
  2. "kRaGTz5jis" 13 22896
  3. "kRaGTz5jis" 13 22903
  4. "kRaGTz5jis"  7 22910
  5. "kRaGTz5jis"  6 22917
  6. "kRaGTz5jis"  7 22924
  7. "kRaGTz5jis"  5 22931
  8. "kRaGTz5jis"  6 22938
  9. "kRaGTz5jis"  8 22945
 10. "kRaGTz5jis"  2 22952
 11. "kRaGTz5jis"  2 22959
 12. "kRaGTz5jis"  2 22966
 13. "kRaGTz5jis"  1 22973
 14. "kRaGTz5jis"  0 22980
 15. "kRaGTz5jis"  0 22987
 16. "kRaGTz5jis"  2 22994
 17. "kRaGTz5jis"  0 23002
 18. "kRaGTz5jis"  2 23009
 19. "kRaGTz5jis"  0 23018
 20. "kRaGTz5jis"  0 23026
 21. end

. format %td date2

. tempfile dataset2

. save `dataset2'
file C:\Users\clyde\AppData\Local\Temp\ST_7c88_000002.tmp saved as .dta format

.
. use `dataset2', clear

. rename date2 date1

. merge 1:1 id date1 using `dataset1'

    Result                      Number of obs
    -----------------------------------------
    Not matched                            34
        from master                        17  (_merge==1)
        from using                         17  (_merge==2)

    Matched                                 3  (_merge==3)
    -----------------------------------------

. rename date1 date

and correctly combines each observation from the first data set with any corresponding observation in the other having the same id and date.

So what is the problem? What about the results it produces is not what you want? Based on what you showed in #1, this looks to be precisely what you are asking for. What am I missing?

Regarding your update, I cannot comment on -merge m:m email date1 using "date2", generate(date1_merge)- because there is no variable email in your example data sets, so I don't know what that is doing. Regardless, I am extremely confident that -merge m:m- is not the right way to go. Since you seem to like the output that you are showing in #4, I will point out that you can get that same result correctly with:

Code:

use `dataset2', clear
rename date2 date1
merge 1:1 id date1 using `dataset1', keep(match using)

Added: Note that in your -merge m:m- command in #4 you are using two merge key variables, not one. I'm guessing that email is an id variable, and probably corresponds exactly to it. When a 1:1 merge is possible on a set of key variables (as here), -merge m:m- produces the same thing as -merge 1:1-. However, when you -merge m:m- using a set of merge key variables for which 1:1, 1:m or m:1 merging is not possible, -merge m:m- produces garbage. Don't use m:m, seriously! It will get you into trouble--it's only a matter of when, not if.

Last edited by Clyde Schechter; 25 Apr 2023, 13:26.

Comment

Anne Christensen

Join Date: Apr 2023

Posts: 28
#6

26 Apr 2023, 06:03

Dear Clyde
I am sorry for having been imprecise, I meant that stata produced the error message that merge 1:1 was not possible since the variables did not uniquely identify in my using data - I realize now it is because some IDs have multiple observations on the same date that I will need to handle somehow.

From what you show it does indeed produce the results I was trying to get, so I will try to adjust my data to that end.

In my example of a 'working' m:m code, sorry I put email instead of id, both are id variables.
Thank you again, I will do my best to avoid m:m!
Comment

Announcement