Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging long format datasets depending on observation date within ID

    Hi,
    I would like to know if there is an easy/easier and more systematic way of merging my datasets (m:m) so that outcomes are med not only within ID but also within dates - I hope this example helps:

    I have several outcomes that I would like to have in the same dataset. They are all measured multiple times within ID, some every day, some once a week, some once a month.

    An example (sorry I don't have access to my data right now, so can't use dataex)

    Outcome 1:

    ID outcome1 date1
    1 56 07sep22
    1 87 08sep22
    1 76 09sep22
    1 89 10sep22
    1 63 11sep22
    1 76 12 sep22
    (etc...)
    2 54 07sep22
    2 76 08sep22

    Outcome 2:
    ID outcome2 date2
    1 5 07sep22
    1 6 12sep22
    1 1 19sep22
    1 3 26sep22
    (etc...)
    2 6 7sep22
    2 9 12sep22


    Is there a way in which I can merge on date within ID so to speak, so that it merges like this:
    ID outome1 date1 outcome 2
    1 56 07sep22 5
    1 87 08sep22 .
    1 76 09sep22 .
    1 89 10sep22 .
    1 63 11sep22 .
    1 76 12 sep22 6
    etc...
    2 54 07sep22 6
    2 76 08sep22 .


    Currently I can only make them merge like this:
    ID outcome1 date1 outcome 2 date2
    1 56 07sep22 5 07sep22
    1 87 08sep22 6 12sep22
    1 76 09sep22 1 19sep22
    1 89 10sep22 3 16sep22
    1 63 11sep22 .
    1 76 12 sep22 .
    etc...
    2 54 07sep22
    2 76 08sep22

    So essentially I would like for there to be a row for each observation of outcome1 and for the observations of outcome2 to be inserted on the corresponding date - and for outcome 2 to otherwise be missing.

    A related question: I find it confusing when missing values are propagated (I think is the term) with the last observed value when merging m:m, eg if I have 50 observations for outcome1 ID 1 but only 30 on outcome2, stata will just 'repeat' the values of the last row of the outcome 2 dataset when I merge it with outcome1 - is there are way to avoid this? Or should they always be nonmissing when I need to do multilevel mixed-effects linear regression (mixed-command)?

    Thank you very much in advance - if helpful I can try to produce a data example.
    Last edited by Anne Christensen; 24 Apr 2023, 13:37.

  • #2
    I would like to know if there is an easy/easier and more systematic way of merging my datasets (m:m)
    Given the -merge m:m- is the worst possible way to combine two data sets, and in almost all circumstances produces data salad, yes, any other way would be better! Seriously, NEVER use -merge m:m-. Whenever you think you need it, either your data sets are wrong or you don't understand your data and are missing how to properly merge them (or both).

    Code:
    use dataset2, clear
    rename date2 date1
    merge 1:1 ID date1 using dataset1
    rename date1 date
    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Dear Clyde
      Thank you for your reply.
      I have seen this advice against m:m several times, but when I have long format with ID not uniquely identifying individuals in both datasets, I don't see how 1:1 could work?
      The suggested code did not work.

      I am now with my data and have here examples of both datasets using daatex:

      Data 1

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str10 id float totalsleeptime long date1
      ""             .     .
      "kRaGTz5jis" 340 22897
      "kRaGTz5jis" 345 22898
      "kRaGTz5jis" 480 22899
      "kRaGTz5jis" 380 22900
      "kRaGTz5jis" 335 22901
      "kRaGTz5jis" 335 22902
      "kRaGTz5jis" 370 22903
      "kRaGTz5jis" 370 22904
      "kRaGTz5jis" 240 22905
      "kRaGTz5jis" 330 22906
      "kRaGTz5jis" 335 22907
      "kRaGTz5jis" 350 22908
      "kRaGTz5jis" 335 22909
      "kRaGTz5jis" 350 22910
      "kRaGTz5jis" 350 22911
      "kRaGTz5jis" 355 22912
      "kRaGTz5jis" 365 22913
      "kRaGTz5jis" 340 22914
      "kRaGTz5jis" 350 22915
      end
      format %td date1
      Data 2

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str10 id byte isiscore long date2
      ""            .     .
      "kRaGTz5jis" 13 22896
      "kRaGTz5jis" 13 22903
      "kRaGTz5jis"  7 22910
      "kRaGTz5jis"  6 22917
      "kRaGTz5jis"  7 22924
      "kRaGTz5jis"  5 22931
      "kRaGTz5jis"  6 22938
      "kRaGTz5jis"  8 22945
      "kRaGTz5jis"  2 22952
      "kRaGTz5jis"  2 22959
      "kRaGTz5jis"  2 22966
      "kRaGTz5jis"  1 22973
      "kRaGTz5jis"  0 22980
      "kRaGTz5jis"  0 22987
      "kRaGTz5jis"  2 22994
      "kRaGTz5jis"  0 23002
      "kRaGTz5jis"  2 23009
      "kRaGTz5jis"  0 23018
      "kRaGTz5jis"  0 23026
      end
      format %td date2
      Thank you,
      Anne

      Comment


      • #4
        Update: I think merge
        m:m email date1 using "date2", generate(date1_merge)
        has partly done the trick, as this produces:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str10 id float totalsleeptime byte isiscore long date1
        ""             .  .     .
        "kRaGTz5jis" 340  . 22897
        "kRaGTz5jis" 345  . 22898
        "kRaGTz5jis" 480  . 22899
        "kRaGTz5jis" 380  . 22900
        "kRaGTz5jis" 335  . 22901
        "kRaGTz5jis" 335  . 22902
        "kRaGTz5jis" 370 13 22903
        "kRaGTz5jis" 370  . 22904
        "kRaGTz5jis" 240  . 22905
        "kRaGTz5jis" 330  . 22906
        "kRaGTz5jis" 335  . 22907
        "kRaGTz5jis" 350  . 22908
        "kRaGTz5jis" 335  . 22909
        "kRaGTz5jis" 350  7 22910
        "kRaGTz5jis" 350  . 22911
        "kRaGTz5jis" 355  . 22912
        "kRaGTz5jis" 365  . 22913
        "kRaGTz5jis" 340  . 22914
        "kRaGTz5jis" 350  . 22915
        end
        format %td date1

        However, I see that IDs are not merged if they do not have observations regsitered on the same date - which makes sense of course since I am also merging on date.

        Will leave this open to hear if anyone has a suggestion how to solve this.
        Last edited by Anne Christensen; 25 Apr 2023, 01:50.

        Comment


        • #5
          What do you mean when you say that the code in #2 "did not work." It runs on the example data you showed:
          Code:
          . * Example generated by -dataex-. For more info, type help dataex
          . clear
          
          . input str10 id float totalsleeptime long date1
          
                       id  totalsl~e         date1
            1. ""             .     .
            2. "kRaGTz5jis" 340 22897
            3. "kRaGTz5jis" 345 22898
            4. "kRaGTz5jis" 480 22899
            5. "kRaGTz5jis" 380 22900
            6. "kRaGTz5jis" 335 22901
            7. "kRaGTz5jis" 335 22902
            8. "kRaGTz5jis" 370 22903
            9. "kRaGTz5jis" 370 22904
           10. "kRaGTz5jis" 240 22905
           11. "kRaGTz5jis" 330 22906
           12. "kRaGTz5jis" 335 22907
           13. "kRaGTz5jis" 350 22908
           14. "kRaGTz5jis" 335 22909
           15. "kRaGTz5jis" 350 22910
           16. "kRaGTz5jis" 350 22911
           17. "kRaGTz5jis" 355 22912
           18. "kRaGTz5jis" 365 22913
           19. "kRaGTz5jis" 340 22914
           20. "kRaGTz5jis" 350 22915
           21. end
          
          . format %td date1
          
          . tempfile dataset1
          
          . save `dataset1'
          file C:\Users\clyde\AppData\Local\Temp\ST_7c88_000001.tmp saved as .dta format
          
          .
          . * Example generated by -dataex-. For more info, type help dataex
          . clear
          
          . input str10 id byte isiscore long date2
          
                       id  isiscore         date2
            1. ""            .     .
            2. "kRaGTz5jis" 13 22896
            3. "kRaGTz5jis" 13 22903
            4. "kRaGTz5jis"  7 22910
            5. "kRaGTz5jis"  6 22917
            6. "kRaGTz5jis"  7 22924
            7. "kRaGTz5jis"  5 22931
            8. "kRaGTz5jis"  6 22938
            9. "kRaGTz5jis"  8 22945
           10. "kRaGTz5jis"  2 22952
           11. "kRaGTz5jis"  2 22959
           12. "kRaGTz5jis"  2 22966
           13. "kRaGTz5jis"  1 22973
           14. "kRaGTz5jis"  0 22980
           15. "kRaGTz5jis"  0 22987
           16. "kRaGTz5jis"  2 22994
           17. "kRaGTz5jis"  0 23002
           18. "kRaGTz5jis"  2 23009
           19. "kRaGTz5jis"  0 23018
           20. "kRaGTz5jis"  0 23026
           21. end
          
          . format %td date2
          
          . tempfile dataset2
          
          . save `dataset2'
          file C:\Users\clyde\AppData\Local\Temp\ST_7c88_000002.tmp saved as .dta format
          
          .
          . use `dataset2', clear
          
          . rename date2 date1
          
          . merge 1:1 id date1 using `dataset1'
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                            34
                  from master                        17  (_merge==1)
                  from using                         17  (_merge==2)
          
              Matched                                 3  (_merge==3)
              -----------------------------------------
          
          . rename date1 date
          and correctly combines each observation from the first data set with any corresponding observation in the other having the same id and date.

          So what is the problem? What about the results it produces is not what you want? Based on what you showed in #1, this looks to be precisely what you are asking for. What am I missing?

          Regarding your update, I cannot comment on -merge m:m email date1 using "date2", generate(date1_merge)- because there is no variable email in your example data sets, so I don't know what that is doing. Regardless, I am extremely confident that -merge m:m- is not the right way to go. Since you seem to like the output that you are showing in #4, I will point out that you can get that same result correctly with:
          Code:
          use `dataset2', clear
          rename date2 date1
          merge 1:1 id date1 using `dataset1', keep(match using)
          Added: Note that in your -merge m:m- command in #4 you are using two merge key variables, not one. I'm guessing that email is an id variable, and probably corresponds exactly to it. When a 1:1 merge is possible on a set of key variables (as here), -merge m:m- produces the same thing as -merge 1:1-. However, when you -merge m:m- using a set of merge key variables for which 1:1, 1:m or m:1 merging is not possible, -merge m:m- produces garbage. Don't use m:m, seriously! It will get you into trouble--it's only a matter of when, not if.

          Last edited by Clyde Schechter; 25 Apr 2023, 13:26.

          Comment


          • #6
            Dear Clyde
            I am sorry for having been imprecise, I meant that stata produced the error message that merge 1:1 was not possible since the variables did not uniquely identify in my using data - I realize now it is because some IDs have multiple observations on the same date that I will need to handle somehow.

            From what you show it does indeed produce the results I was trying to get, so I will try to adjust my data to that end.

            In my example of a 'working' m:m code, sorry I put email instead of id, both are id variables.
            Thank you again, I will do my best to avoid m:m!

            Comment

            Working...
            X