Joinby-Stata freezes

Roman Mostazir

Join Date: Apr 2014

Posts: 877
#1

Joinby-Stata freezes

04 Mar 2023, 06:59

Dear Statalisters,

I have the following two datasets (Datase-1 and Dataset2). Dataset-1 has arround 5million observations for repeated ids and dates (qdate). Dataset-2 has common ids but different dates (sdate) with 1million observations. Being aware of the fact that m:m is a bad thing to do, I was hoping to use 'joinby' having dataset-1 as master and 2 as 'using' and bring 'sdate' from dataset-2 to dataset-1. My Stata freezes. I will highly appreciate any suggestion on how to merge these two datasets.

In a vain attempt, I splitted the dataset-1 as 4 parts with rougly equal observations (±1milliion) but still Stata freezes. Please see below the dataset structures. In the examples below though I presented on date for brevity, the dates will vary within person in both datasets.

Dataset1:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double id int qdate 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 10013930 22492 end format %td qdate

Dataset2:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long id int sdate 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 10013930 22292 end format %td sdate

//Joinby command:

Code:

joinby id using data2.dta, unmatched(both)

Stata version: Version 17.0

System info:

Code:

Hardware Overview: Model Name: MacBook Air Chip: Apple M2 Total Number of Cores: 8 (4 performance and 4 efficiency) Memory: 8 GB System Firmware Version: 8419.80.7 OS Loader Version: 8419.80.7 OS: Ventura 13.2.1

Last edited by Roman Mostazir; 04 Mar 2023, 07:07.

Roman
Tags: None
Daniel Shin

Join Date: Mar 2020

Posts: 146
#2

04 Mar 2023, 10:57

If you joinby your examples, you will go from a dataset with 12 observations to 144 (12x12) observations. It can blow up pretty quickly (you should estimate how many records you will end up with by doing the joinby).

Do you really need to duplicate all records for an individual in your second dataset for each of that individual's record in the first dataset? If you describe how you will be using your data, we could suggest better options than joinby.
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 877
#3

04 Mar 2023, 11:19

Hi Daniel,
Thanks for the reply. No I don't wish to duplicate all records from the 2nd dataset to the first but I wish to bring the 'sdate' from 2nd dataset to the 1st dataset based on id match. But since, IDs are repeated in both datasets and m:m is treacherous, is there any other technique to merge them?

Roman
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#4

04 Mar 2023, 11:24

In your example, sdate is the same for all observations for a given ID. Is this this the case? If sdate is all you care about, reduce your second dataset such that you remove all duplicate sdates. In your example, this would not lead to an increase in the number of observations.
1 like
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 877

04 Mar 2023, 11:33

Hi Daniel,
I would have, if I could. My #1 post says that the dates vary within id in both datasets. For example see below the 'sdate' for another id:

Code:

    id    sdate  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   23jan2021  
    10007820   24jan2021  
    10007820   24jan2021  
    10007820   24jan2021  
    10007820   24jan2021  
    10007820   24jan2021

Roman

Comment

Daniel Shin

Join Date: Mar 2020

Posts: 146
#6

04 Mar 2023, 11:48

If you only need sdate and do not need all the duplicates, perform the following in data2:

Code:

use data2, clear duplicates drop save data2_nodups use data1, clear joinby id using data2, unmatched(both)

I still feel this might not be the approach, but can't read your intentions with the dataset.
1 like
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 877
#7

04 Mar 2023, 11:58

#6 Will delete duplicate dates when patients used different therapies recorded within each day. I thought of this option too but not a viable option in this case.

Thanks Daniel for your help. This dataset has gone through various complex phases of data collections. I probably need to think differently hereby I am happy to close the thread.

Thanks again.

Thread closed.

Roman
Comment

Announcement

Joinby-Stata freezes

Comment

Comment

Comment

Comment

Comment

Comment