Brain-teasing problem of creating a long chain using binary relationship data

James Park

Join Date: May 2017

Posts: 97
#1

Brain-teasing problem of creating a long chain using binary relationship data

06 Nov 2023, 09:56

I have this data on the start and end month of Giver-Receiver relationship.

Giver Receiver Start End
A B 201807 201808
A B 202001 202003
B C 201902 201903
B D 202002 202004

I want to create a file on the duration of longer "chain" relationships.

The chain from A to B to D has an overlapped time from 202003 to 202112.

But chain from A to B to C does not have any overlapped time, so it won't create any line in the output.

The desired output file is either

ID1 ID2 ID3 Start End
A B D 202002 202003

or

ID1 ID2 ID3 Month
A B D 202002
A B D 202003

Either one form is fine. Both are hard.

I actually want to create any 4-firm or 5-firm or any length of longer chains, but creating 3-firm chain is already hard enough, so I would like begin with creating 3-firm chain.

This seems to be a hard problem on Stata or on any language.

One idea I tried to pursue was to create a data like this (I don't know how to):

Giver Receiver Month
A B 201807
A B 201808
A B 202001
A B 202002
A B 202003
B C 201902
B C 201903
B D 202002
B D 202003
B D 202004

and from this I construct my goal, but again I can't figure out how to do this, either. Just an idea.

What command or trick will likely to help me?

Code:

gen Giver="" gen Receiver="" gen Start="" gen End="" set obs 4 replace Giver="A" if _n==1 | _n==2 replace Receiver="B" if _n==1 | _n==2 replace Giver="A" if _n==3 | _n==4 replace Receiver="C" if _n==3 replace Receiver="D" if _n==4 replace Start="201807" if _n==1 replace Start="202001" if _n==2 replace Start="201902" if _n==3 replace Start="202002" if _n==4 replace End="201808" if _n==1 replace End="202003" if _n==2 replace End="201903" if _n==3 replace End="202004" if _n==4
Tags: data cleaning, data restructuring
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

06 Nov 2023, 11:13

Well, at least for the length 3 chains it isn't really very hard.

First, although the code would actually work without doing this, it is better when working with dates in Stata to use real Stata internal format date variables, not numbers that read as dates to the human eye but would potentially be mishandled by Stata in calculations. (In this case, these non-Stata-ish date variables sort correctly, so this particular code, which depends only on the order of the dates, would work anyway.)

Note: "The chain from A to B to D has an overlapped time from 202003 to 202112" is not correct, and contradicts the correct results you later say you desire:

ID1 ID2 ID3 Start End
A B D 202002 202003

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str1(giver receiver) long(start end) "A" "B" 201807 201808 "A" "B" 202001 202003 "B" "C" 201902 201903 "B" "D" 202002 202004 end // CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES foreach v of varlist start end { gen _`v' = ym(floor(`v'/100), mod(`v', 100)) assert missing(`v') == missing(_`v') format _`v' %tm drop `v' rename _`v' `v' } // PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN: // I.E. RECEIVER1 = GIVER2 preserve rename receiver receiver2 rename giver receiver rename (start end) =2 tempfile to_join save `to_join' restore joinby receiver using `to_join' rename (start end) =1 // NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP keep if max(start1, start2) < min(end1, end2) gen start = max(start1, start2) gen end = min(end1, end2) format start end %tm drop start1 start2 end1 end2

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
2 likes
Comment
James Park

Join Date: May 2017

Posts: 97
#3

06 Nov 2023, 20:55

Originally posted by Clyde Schechter View Post

Well, at least for the length 3 chains it isn't really very hard.

First, although the code would actually work without doing this, it is better when working with dates in Stata to use real Stata internal format date variables, not numbers that read as dates to the human eye but would potentially be mishandled by Stata in calculations. (In this case, these non-Stata-ish date variables sort correctly, so this particular code, which depends only on the order of the dates, would work anyway.)

Note: "The chain from A to B to D has an overlapped time from 202003 to 202112" is not correct, and contradicts the correct results you later say you desire:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str1(giver receiver) long(start end) "A" "B" 201807 201808 "A" "B" 202001 202003 "B" "C" 201902 201903 "B" "D" 202002 202004 end // CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES foreach v of varlist start end { gen _`v' = ym(floor(`v'/100), mod(`v', 100)) assert missing(`v') == missing(_`v') format _`v' %tm drop `v' rename _`v' `v' } // PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN: // I.E. RECEIVER1 = GIVER2 preserve rename receiver receiver2 rename giver receiver rename (start end) =2 tempfile to_join save `to_join' restore joinby receiver using `to_join' rename (start end) =1 // NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP keep if max(start1, start2) < min(end1, end2) gen start = max(start1, start2) gen end = min(end1, end2) format start end %tm drop start1 start2 end1 end2

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

This is a great solution. Thank you. I could also extend it to create maximum possible chain length.

This code works for small data.

But it appears that joinby takes a huge memory on a large enough data.

Stata says

Code:

attempt to use too much memory You just attempted to use more memory than this Stata can currently accommodate. The maximum is 1600g bytes. That maximum is probably more than your computer can provide. You can, however, increase the maximum by increasing segmentsize, which is currently 32m; see help memory. Double segment size, and you double the maximum.

Perhaps dividing the tempfile into 10 chunks will do the trick?

Last edited by James Park; 06 Nov 2023, 21:02.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

06 Nov 2023, 21:13

This code works for small data.

But it appears that joinby takes a huge memory on a large enough data.

That is correct. -joinby- pairs up every observation in `to_join' with every observation in the original data set that has the same value of variable receiver. So, yes this can easily break through the limits of memory. But this is a problem with the intermediate result only. The end-result, which has weeded out those pairs where there is no overlap in the intervals, will be smaller, possibly much smaller.

There is a possible workaround for this. For any given receiver, the observations involving other receivers are partly irrelevant. So, you can run this code for one receiver (from the original data) at a time, pair them up with observations having the same receiver (which was giver in the original data set) in `to_join', save those results, and then append the results all together. This approach avoids creating a gargantuan intermediate data set. Robert Picard and I wrote -runby- several years ago, which automates this, and also leverages some efficiencies that come from the use of Mata. -runby- is available from SSC. The code would look like this (changes in bold face):

Code:

// CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES foreach v of varlist start end { gen _`v' = ym(floor(`v'/100), mod(`v', 100)) assert missing(`v') == missing(_`v') format _`v' %tm drop `v' rename _`v' `v' } // PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN: // I.E. RECEIVER1 = GIVER2 preserve rename receiver receiver2 rename giver receiver rename (start end) =2 tempfile to_join sort receiver save `to_join' restore gen join_file = "`to_join'" capture program drop one_receiver program define one_receiver local to_join = join_file[1] joinby receiver using `to_join' rename (start end) =1 // NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP keep if max(start1, start2) < min(end1, end2) gen start = max(start1, start2) gen end = min(end1, end2) format start end %tm drop start1 start2 end1 end2 join_file exit end runby one_receiver, by(receiver) status

Since your file is large enough to blow through memory limits, I imagine that the original code would also take a very long time to run. -runby- will speed things up a bit, and it will also (because of the -status- option) provide an updated progress report showing how many receivers have been processed so far and an estimate of the remaining time.

Now, there is no guarantee that this will work. It depends on the number of receivers and how many observations each has. If, as a worst case scenario, your data set was many millions of observations but all had the same receiver, the memory situation would be no different from before. But if you have a large number of different receivers, each with moderate numbers of observations, this will work just fine. It's at least worth a try.

Last edited by Clyde Schechter; 06 Nov 2023, 21:27.
1 like
Comment
James Park

Join Date: May 2017

Posts: 97
#5

06 Nov 2023, 22:14

Thank you so much. I actually found a problem in my attempt at extending it to more than 3 chains (4,5,6... chains). In my initial attempt, when doing joinby, the master data was the combined 3-chain and 2-chains, so that blew up the size of data. But then after making the master data to be just 3-chains and using data to be 2-chains, then it works fine.
Comment

Announcement

Brain-teasing problem of creating a long chain using binary relationship data

Comment

Comment

Comment

Comment