Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Joinby-Stata freezes

    Dear Statalisters,

    I have the following two datasets (Datase-1 and Dataset2). Dataset-1 has arround 5million observations for repeated ids and dates (qdate). Dataset-2 has common ids but different dates (sdate) with 1million observations. Being aware of the fact that m:m is a bad thing to do, I was hoping to use 'joinby' having dataset-1 as master and 2 as 'using' and bring 'sdate' from dataset-2 to dataset-1. My Stata freezes. I will highly appreciate any suggestion on how to merge these two datasets.

    In a vain attempt, I splitted the dataset-1 as 4 parts with rougly equal observations (±1milliion) but still Stata freezes. Please see below the dataset structures. In the examples below though I presented on date for brevity, the dates will vary within person in both datasets.

    Dataset1:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double id int qdate
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    10013930 22492
    end
    format %td qdate
    Dataset2:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id int sdate
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    10013930 22292
    end
    format %td sdate
    //Joinby command:

    Code:
    joinby id using data2.dta, unmatched(both)
    Stata version: Version 17.0

    System info:

    Code:
    Hardware Overview:
      Model Name:    MacBook Air
      Chip:    Apple M2
      Total Number of Cores:    8 (4 performance and 4 efficiency)
      Memory:    8 GB
      System Firmware Version:    8419.80.7
      OS Loader Version:    8419.80.7
    
    OS: Ventura 13.2.1
    Last edited by Roman Mostazir; 04 Mar 2023, 07:07.
    Roman

  • #2
    If you joinby your examples, you will go from a dataset with 12 observations to 144 (12x12) observations. It can blow up pretty quickly (you should estimate how many records you will end up with by doing the joinby).

    Do you really need to duplicate all records for an individual in your second dataset for each of that individual's record in the first dataset? If you describe how you will be using your data, we could suggest better options than joinby.

    Comment


    • #3
      Hi Daniel,
      Thanks for the reply. No I don't wish to duplicate all records from the 2nd dataset to the first but I wish to bring the 'sdate' from 2nd dataset to the 1st dataset based on id match. But since, IDs are repeated in both datasets and m:m is treacherous, is there any other technique to merge them?
      Roman

      Comment


      • #4
        In your example, sdate is the same for all observations for a given ID. Is this this the case? If sdate is all you care about, reduce your second dataset such that you remove all duplicate sdates. In your example, this would not lead to an increase in the number of observations.

        Comment


        • #5
          Hi Daniel,
          I would have, if I could. My #1 post says that the dates vary within id in both datasets. For example see below the 'sdate' for another id:

          Code:
              id    sdate  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   23jan2021  
              10007820   24jan2021  
              10007820   24jan2021  
              10007820   24jan2021  
              10007820   24jan2021  
              10007820   24jan2021

          Roman

          Comment


          • #6
            If you only need sdate and do not need all the duplicates, perform the following in data2:

            Code:
            use data2, clear
            duplicates drop
            save data2_nodups
            
            use data1, clear
            joinby id using data2, unmatched(both)
            I still feel this might not be the approach, but can't read your intentions with the dataset.

            Comment


            • #7
              #6 Will delete duplicate dates when patients used different therapies recorded within each day. I thought of this option too but not a viable option in this case.

              Thanks Daniel for your help. This dataset has gone through various complex phases of data collections. I probably need to think differently hereby I am happy to close the thread.

              Thanks again.

              Thread closed.
              Roman

              Comment

              Working...
              X