Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brain-teasing problem of creating a long chain using binary relationship data

    I have this data on the start and end month of Giver-Receiver relationship.

    Giver Receiver Start End
    A B 201807 201808
    A B 202001 202003
    B C 201902 201903
    B D 202002 202004

    I want to create a file on the duration of longer "chain" relationships.

    The chain from A to B to D has an overlapped time from 202003 to 202112.

    But chain from A to B to C does not have any overlapped time, so it won't create any line in the output.

    The desired output file is either

    ID1 ID2 ID3 Start End
    A B D 202002 202003

    or

    ID1 ID2 ID3 Month
    A B D 202002
    A B D 202003

    Either one form is fine. Both are hard.

    I actually want to create any 4-firm or 5-firm or any length of longer chains, but creating 3-firm chain is already hard enough, so I would like begin with creating 3-firm chain.

    This seems to be a hard problem on Stata or on any language.

    One idea I tried to pursue was to create a data like this (I don't know how to):

    Giver Receiver Month
    A B 201807
    A B 201808
    A B 202001
    A B 202002
    A B 202003
    B C 201902
    B C 201903
    B D 202002
    B D 202003
    B D 202004

    and from this I construct my goal, but again I can't figure out how to do this, either. Just an idea.

    What command or trick will likely to help me?


    Code:
    gen Giver=""
    gen Receiver=""
    gen Start=""
    gen End=""
    set obs 4
    replace Giver="A" if _n==1 | _n==2
    replace Receiver="B" if _n==1 | _n==2
    replace Giver="A" if _n==3 | _n==4
    replace Receiver="C" if _n==3
    replace Receiver="D" if _n==4
    replace Start="201807" if _n==1
    replace Start="202001" if _n==2
    replace Start="201902" if _n==3
    replace Start="202002" if _n==4
    replace End="201808" if _n==1
    replace End="202003" if _n==2
    replace End="201903" if _n==3
    replace End="202004" if _n==4

  • #2
    Well, at least for the length 3 chains it isn't really very hard.

    First, although the code would actually work without doing this, it is better when working with dates in Stata to use real Stata internal format date variables, not numbers that read as dates to the human eye but would potentially be mishandled by Stata in calculations. (In this case, these non-Stata-ish date variables sort correctly, so this particular code, which depends only on the order of the dates, would work anyway.)

    Note: "The chain from A to B to D has an overlapped time from 202003 to 202112" is not correct, and contradicts the correct results you later say you desire:
    ID1 ID2 ID3 Start End
    A B D 202002 202003
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1(giver receiver) long(start end)
    "A" "B" 201807 201808
    "A" "B" 202001 202003
    "B" "C" 201902 201903
    "B" "D" 202002 202004
    end
    
    //    CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES
    foreach v of varlist start end {
        gen _`v' = ym(floor(`v'/100), mod(`v', 100))
        assert missing(`v') == missing(_`v')
        format _`v' %tm
        drop `v'
        rename _`v' `v'
    }
    
    //    PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN:
    //  I.E. RECEIVER1 = GIVER2
    preserve
    rename receiver receiver2
    rename giver receiver
    rename (start end) =2
    tempfile to_join
    save `to_join'
    restore
    
    joinby receiver using `to_join'
    rename (start end) =1
    
    //  NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP
    keep if max(start1, start2) < min(end1, end2)
    gen start = max(start1, start2)
    gen end = min(end1, end2)
    format start end %tm
    drop start1 start2 end1 end2
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.



    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Well, at least for the length 3 chains it isn't really very hard.

      First, although the code would actually work without doing this, it is better when working with dates in Stata to use real Stata internal format date variables, not numbers that read as dates to the human eye but would potentially be mishandled by Stata in calculations. (In this case, these non-Stata-ish date variables sort correctly, so this particular code, which depends only on the order of the dates, would work anyway.)

      Note: "The chain from A to B to D has an overlapped time from 202003 to 202112" is not correct, and contradicts the correct results you later say you desire:


      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str1(giver receiver) long(start end)
      "A" "B" 201807 201808
      "A" "B" 202001 202003
      "B" "C" 201902 201903
      "B" "D" 202002 202004
      end
      
      // CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES
      foreach v of varlist start end {
      gen _`v' = ym(floor(`v'/100), mod(`v', 100))
      assert missing(`v') == missing(_`v')
      format _`v' %tm
      drop `v'
      rename _`v' `v'
      }
      
      // PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN:
      // I.E. RECEIVER1 = GIVER2
      preserve
      rename receiver receiver2
      rename giver receiver
      rename (start end) =2
      tempfile to_join
      save `to_join'
      restore
      
      joinby receiver using `to_join'
      rename (start end) =1
      
      // NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP
      keep if max(start1, start2) < min(end1, end2)
      gen start = max(start1, start2)
      gen end = min(end1, end2)
      format start end %tm
      drop start1 start2 end1 end2
      In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use -dataex-.



      This is a great solution. Thank you. I could also extend it to create maximum possible chain length.

      This code works for small data.

      But it appears that joinby takes a huge memory on a large enough data.

      Stata says

      Code:
      attempt to use too much memory
          You just attempted to use more memory than this Stata can currently accommodate.
          The maximum is 1600g bytes.  That maximum is probably more than your computer can
          provide.  You can, however, increase the maximum by increasing segmentsize, which
          is currently 32m; see help memory.  Double segment size, and you double the
          maximum.

      Perhaps dividing the tempfile into 10 chunks will do the trick?
      Last edited by James Park; 06 Nov 2023, 21:02.

      Comment


      • #4
        This code works for small data.

        But it appears that joinby takes a huge memory on a large enough data.
        That is correct. -joinby- pairs up every observation in `to_join' with every observation in the original data set that has the same value of variable receiver. So, yes this can easily break through the limits of memory. But this is a problem with the intermediate result only. The end-result, which has weeded out those pairs where there is no overlap in the intervals, will be smaller, possibly much smaller.

        There is a possible workaround for this. For any given receiver, the observations involving other receivers are partly irrelevant. So, you can run this code for one receiver (from the original data) at a time, pair them up with observations having the same receiver (which was giver in the original data set) in `to_join', save those results, and then append the results all together. This approach avoids creating a gargantuan intermediate data set. Robert Picard and I wrote -runby- several years ago, which automates this, and also leverages some efficiencies that come from the use of Mata. -runby- is available from SSC. The code would look like this (changes in bold face):

        Code:
        //    CONVERT START AND END DATES TO STATA INTERNAL FORMAT MONTHLY DATES
        foreach v of varlist start end {
            gen _`v' = ym(floor(`v'/100), mod(`v', 100))
            assert missing(`v') == missing(_`v')
            format _`v' %tm
            drop `v'
            rename _`v' `v'
        }
        
        //    PAIR EACH OBSERVATION WITH A POTENTIAL FOLLOW-ON IN THE CHAIN:
        //  I.E. RECEIVER1 = GIVER2
        preserve
        rename receiver receiver2
        rename giver receiver
        rename (start end) =2
        tempfile to_join
        sort receiver
        save `to_join'
        restore
        gen join_file = "`to_join'"
        
        capture program drop one_receiver
        program define one_receiver
            local to_join = join_file[1]
            joinby receiver using `to_join'
            rename (start end) =1
        
            //  NOW RESTRICT TO THOSE WHERE THE PERIODS IN QUESTION OVERLAP
            keep if max(start1, start2) < min(end1, end2)
            gen start = max(start1, start2)
            gen end = min(end1, end2)
            format start end %tm
            drop start1 start2 end1 end2 join_file
            exit
        end
        
        runby one_receiver, by(receiver) status
        Since your file is large enough to blow through memory limits, I imagine that the original code would also take a very long time to run. -runby- will speed things up a bit, and it will also (because of the -status- option) provide an updated progress report showing how many receivers have been processed so far and an estimate of the remaining time.

        Now, there is no guarantee that this will work. It depends on the number of receivers and how many observations each has. If, as a worst case scenario, your data set was many millions of observations but all had the same receiver, the memory situation would be no different from before. But if you have a large number of different receivers, each with moderate numbers of observations, this will work just fine. It's at least worth a try.
        Last edited by Clyde Schechter; 06 Nov 2023, 21:27.

        Comment


        • #5
          Thank you so much. I actually found a problem in my attempt at extending it to more than 3 chains (4,5,6... chains). In my initial attempt, when doing joinby, the master data was the combined 3-chain and 2-chains, so that blew up the size of data. But then after making the master data to be just 3-chains and using data to be 2-chains, then it works fine.

          Comment

          Working...
          X