Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loop to avoid using more variables than Stata allows

    Hi everyone,

    I am working with a lot of data and could really use your help. I am trying to create a pairwise measure for some variables for which I create a matrix out of two categorical variables (one of which I then turn into dummy variables for inventors, so a 1 is indicating that assignee i has worked that inventor). However, I can't simply create dummies for all the categories of the inventor variable, because those would be more variables than stata would allow me to use. Is there a way for me to write a loop that always takes two assignees at a time and then creates my similarity measure for these and in the end adds everything together?

    Given my data example, below is the code that I have been working with on smaller samples. What I want to do from here is have the code loop through a few pairs of assignees at a time, somehow store the results (in pairwise format) and merge them, so that in the end I will get a data set which has three variables | assignee_id_1 assignee_id_2 total_inv | where the last variable is the measure denoting something along the lines of the total shared inventors. Note that the first part of the code (under **manual dummy creator) won't work in my scenario because there are two many variables being created and would have to be incorporated into the loop:

    I have also added a data example at the very bottom. Please do let me know should anything be unclear and I will try to resolve any confusion as quickly as possible. I would really appreciate any help I could get.

    Thanks,
    John

    Code:
    *********************************
    encode inventor_id, gen(enc_inv_id)
    **manual dummy creator
    levelsof enc_inv_id, local(n)
    foreach j in `n'{
    gen d_`j'= `j' == enc_inv_id
    }
    
    ***create matrix
    collapse(max) d_*, by(assignee_id)
    
    
    
    ******
    ******Jaccard
    ******
    ******
    ******Jaccard
    ******
    preserve
    rename * *_inv
    tempfile id2_inv
    save `id2_inv'
    restore
    cross using `id2_inv'
    drop if assignee_id== assignee_id_inv
    drop if assignee_id< assignee_id_inv
    
    *number is 30325 for 500
    forval i=1/30235{
    capture gen tinv_`i'= cond((d_`i'+d_`i'_inv) > 1, 1, (d_`i'+d_`i'_inv))
    }
    
    egen total_inv= rowtotal(tinv_*)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 inventor_id str36 assignee_id
    "10082632-2" "org_LzsrUkjiWAIjd5v2paoI"
    "7107498-1"  "org_kSu9KMJQP5Qjd1ksQEfJ"
    "7298882-3"  "org_Re0fk3jNFREw0dcTKfbh"
    "5493647-7"  "org_vhtWu21o8nVXJPQm82Rg"
    "10050270-2" "org_vhtWu21o8nVXJPQm82Rg"
    "10072529-1" "org_vhtWu21o8nVXJPQm82Rg"
    "10085369-1" "org_0b1dsqimqIilzoArXt5E"
    "10085369-1" "org_0b1dsqimqIilzoArXt5E"
    "10079282-2" "org_vhtWu21o8nVXJPQm82Rg"
    "10066281-3" "org_D6QsTT7TAi7VCJKGWsch"
    "7797813-4"  "org_KkXCLSsnWnBXYx4AkNuK"
    "10038545-4" "org_ISzfczIbBwtvODSmyDh0"
    "6020252-1"  "org_SIFMk7wufuE8EAAX00co"
    "7397516-1"  "org_0b1dsqimqIilzoArXt5E"
    "10044878-2" "org_Z5OO886eRu5q4OFbfJki"
    "6041748-2"  "org_1zuIlrebuviyK6v0YrBW"
    "5973456-2"  "org_0b1dsqimqIilzoArXt5E"
    "8132276-1"  "org_Re0fk3jNFREw0dcTKfbh"
    "9970057-2"  "org_MXADrSM40Tb8jm6v4hYe"
    "6583036-1"  "org_hke26dncXOrqlNsMhgUL"
    "10000438-1" "org_J6ZIJIkyzJ4mfvUJ0EPZ"
    "8550478-4"  "org_UVkz6sbdutjKFh5akYfA"
    "6615990-3"  "org_1zuIlrebuviyK6v0YrBW"
    "7594444-5"  "org_69z9bh3CjxKHL2yOCxil"
    "5287700-2"  "org_wHd57MIpQIC17zih5MDg"
    "5140178-1"  "org_vhtWu21o8nVXJPQm82Rg"
    "5432481-1"  "org_vhtWu21o8nVXJPQm82Rg"
    "10014593-1" "org_wHd57MIpQIC17zih5MDg"
    "7324670-1"  "org_vhtWu21o8nVXJPQm82Rg"
    "10040763-3" "org_OeOCSMKvIyMnrOBiLUWE"
    "10040763-3" "org_OeOCSMKvIyMnrOBiLUWE"
    "7805565-3"  "org_STc1KHT2TAIOvxitTruU"
    "9878450-7"  "org_wHd57MIpQIC17zih5MDg"
    "5281520-3"  "org_l2aE7VeZKBQmMw8FsMro"
    "5281520-3"  "org_l2aE7VeZKBQmMw8FsMro"
    "10135172-1" "org_LzuWdBdhsJGSOccRyR7c"
    "3930622-1"  "org_vhtWu21o8nVXJPQm82Rg"
    "9112343-6"  "org_wHd57MIpQIC17zih5MDg"
    "10002621-1" "org_ISzfczIbBwtvODSmyDh0"
    "10002621-1" "org_ISzfczIbBwtvODSmyDh0"
    "10002621-1" "org_ISzfczIbBwtvODSmyDh0"
    "8951673-1"  "org_hwwhVrm6DmcUHmptTCv2"
    "8951673-1"  "org_hwwhVrm6DmcUHmptTCv2"
    "5746724-3"  "org_gzQbESqZ6fQ4qpVglKBf"
    "5632663-1"  "org_StddogfTkajU6bPvEq59"
    "10025048-1" "org_hke26dncXOrqlNsMhgUL"
    "10076373-1" "org_bl4IwhdXgo8T0BvN5rgl"
    "10012962-4" "org_67H6kwB9En0b4zX9LlgF"
    "7312330-5"  "org_7Ii76o0QYwYxPzsgHLSD"
    "10100121-3" "org_DyNSIsS5orV1ZMcyOuCN"
    end

  • #2
    How many different assignees does your actual data have? That would affect what is a reasonable solution.

    Comment


    • #3
      Unfortunately, the example data you show does not contain any instances of a pair of assignees having any inventors in common. To test my code I had to modify your data. I hope my test will cover the full generality of possibilities in your data.

      Code:
      //    CREATE A DATASET OF ASSIGNEE PAIRS
      tempfile original
      save `original'
      keep assignee
      tempfile copy
      save `copy'
      rename assignee_id assignee1
      cross using `copy'
      rename assignee_id assignee2
      tempfile pairs
      keep if assignee1 > assignee2
      duplicates drop
      save `pairs'
      
      use `original', clear
      duplicates drop
      sort assignee_id inventor_id
      tempfile collaborations
      save `collaborations'
      
      capture program drop one_pair
      program define one_pair
          local collaborations = collaborations[1]
          gen assignee_id = assignee1
          merge 1:m assignee_id using `collaborations', keep(match) nogenerate
          replace assignee_id = assignee2
          merge 1:m assignee_id inventor_id using `collaborations', keep(match master)
          count if _merge == 3
          gen total_inv = r(N)
          keep in 1
          keep assignee1 assignee2 total_inv
          exit
      end
      
      
      use `pairs', clear
      gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
      runby one_pair, by(assignee1 assignee2) status
      -runby- is written by Robert Picard and me, and is available from SSC.

      Added: In editing, I added the boldfaced -sort- command; it is not necessary to get correct results but it decreases the execution time required enormously. By sorting the `collaborations' file once and for all, you avoid having to sort it (twice!) every time program one_pair is called by -runby-. Even on a small example such as the one posted, the time savings is huge.
      Last edited by Clyde Schechter; 20 Mar 2019, 19:04.

      Comment


      • #4
        Mike Lacy I am currently working with a sample of 2000 assignees, but am hoping to scale that number up higher later in the process. I have data on around 50k.

        Clyde Schechter thank you so much for the code. I was not aware of the runby command, but it seems to do exactly what I am looking for. Your help is highly appreciated. If i might ask a follow up question: I am just getting started working with programming function, so I am unsure how to add an extra part into your code. I was hoping to just be able to add that myself after having gotten started, but since you use a slightly different syntax (using the r(N) function, which seems smarter), I have found myself getting stuck again. Apologies for having to ask twice. After having assigned the total_inv measure, I want to use it as the denominator in a ratio measuring shared inventors over all inventors. I am, now, struggling to recreate the code for the shared inventors. To illustrate, I have added my previous code below (at least the one starting from the total_inv measure). Again, I am now wondering how recreate the loop and egen command that subsequently create the variable I had called "count_inv".

        Again, thanks so much for your help!

        Code:
        *This part you had already illustrated nicely before.
        forval i=1/16335{
        capture gen tinv_`i'= cond((d_`i'+d_`i'_inv) > 1, 1, (d_`i'+d_`i'_inv))
        }
        
        egen total_inv= rowtotal(tinv_*)
        
        
        **this is the part I am struggling to add
        forval i=1/16335{
        capture replace tinv_`i'= cond((d_`i'+d_`i'_inv) == 2, 1, 0)
        }
        
        egen count_inv= rowtotal(tinv_*)
        
        gen similarity_inv= (count_inv/total_inv)*100

        Comment


        • #5
          First, let's be clear: the variable calculated by my code in #3 is the number of inventors shared by both members of the pair of assignees. I perhaps erred in calling it tot_inv. If you want another variable that really is the total number of inventors associated to either assignee, we have to modify the code a bit.

          Code:
          //    CREATE A DATASET OF ASSIGNEE PAIRS
          tempfile original
          save `original'
          keep assignee
          tempfile copy
          save `copy'
          rename assignee_id assignee1
          cross using `copy'
          rename assignee_id assignee2
          tempfile pairs
          keep if assignee1 > assignee2
          duplicates drop
          save `pairs'
          
          use `original', clear
          duplicates drop
          sort assignee_id inventor_id
          tempfile collaborations
          save `collaborations'
          
          capture program drop one_pair
          program define one_pair
              local collaborations = collaborations[1]
              assert _N == 1
              expand 2
              replace assignee1 = assignee2 in 2
              drop assignee2
              rename assignee1 assignee_id
              merge 1:m assignee_id using `collaborations', keep(match)
              by inventor_id (assignee_id), sort: gen status = _N if _n == 1
              assert inlist(status, 1, 2) | missing(status)
              count if status == 2
              local shared_inv = r(N)
              count if !missing(status)
              local total_inv = r(N) 
              by assignee_id, sort: keep if _n == 1
              keep assignee_id
              gen assignee2 = assignee_id[2]
              rename assignee_id assignee1
              keep in 1
              gen total_inv = `total_inv'
              gen shared_inv = `shared_inv'
              exit
          end
          
          
          use `pairs', clear
          gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
          runby one_pair, by(assignee1 assignee2) status status
          All of the changes to the code are found inside program one_pair. It now calculates both the total number of inventors that worked with either assignee (but does not double count any who worked with both) -- total_inv, and the number of inventors that the pair of assignees have in common -- shared_inv.

          Comment


          • #6
            Thank you so much Clyde Schechter, that is exactly what I meant. My apologies if that was unclear.

            Trying to use this code with some of the larger samples of my data, I have run into some new problems, however. When I reach the 'cross' command, I repeatedly get the error message that the "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations". I think I understand where the message is coming from, but am wondering how to work my way around it, as I am trying to find a way in which only part of the data is used at a time. I do have access to an MP version of Stata as well, but with some of the other parts of my data, I would expect the cross command to exceed the 20 billion threshold that MP has as well.

            Comment


            • #7
              Well, the first thing I would try, which may or may not be sufficient to solve the problem, is to modify the code that creates the pairs file as follows:

              Code:
              //    CREATE A DATASET OF ASSIGNEE PAIRS
              tempfile original
              save `original'
              keep assignee
              duplicates drop
              tempfile copy
              save `copy'
              rename assignee_id assignee1
              cross using `copy'
              rename assignee_id assignee2
              tempfile pairs
              keep if assignee1 > assignee2
              duplicates drop
              save `pairs'
              Many assignees appear more than once in the original data. In your example data, the average number of appearances per assignee is 1.8. Since the size of the crossed data is the square of the size of the original, this roughly 56% reduction in the data set of assignees will reduce the crossed data to about 1/3 of its current size.

              If that does not suffice, then it will be necessary to partition the data at an earlier stage and that is somewhat complicated. So try the above first and see if it suffices.

              Note, by the way, that the maximum number of observations in Stata MP is a little more than 1 trillion observations, so with the code as modified here you will only break the limit on the number of observations in the -cross- if the number of distinct assignees in your code is over 1 million. In fact, I would imagine that in MP you would bump up against the memory limits of the computer before that, if your computer is a typical business machine.

              The part of the code that prepairs the pairs file is the only one that uses the data as a whole. The rest of the code feeds a single pair of assignees at a time into memory and then brings in only those inventors that worked with them, never anything more. The large file of pairs is actually not in memory: it is held in a Mata matrix (the size of which is limited only by available memory).

              Comment


              • #8
                Perhaps I'm misunderstanding, but can't one use -joinby- here rather than -cross-? My understanding of this problem was that the relevant assignee1 X assignee2 pairs were those that shared an inventor. Wouldn't it therefore work to do (conceptually):
                Code:
                use assignee1
                joinby inventor using assignee2
                This produces a file of order N_inventor rather than (N_assignee)^2, although it might take a bigger file in process. And, it's not as fast as -runby-, but perhaps it would avoid the memory limits. This seems too simple, so my apologies if I'm distracting you all from the right approach.
                Last edited by Mike Lacy; 23 Mar 2019, 16:32.

                Comment


                • #9
                  Mike Lacy In the original post, that would have worked. But now we also have to count all inventors who worked for either assignee, not just the ones they had in common, and we have to do that for every possible pair of assignees.

                  Comment


                  • #10
                    Actually, I realized overnight that it isn't as complicated to avoid the large cross as I imagined. This works:

                    Code:
                    //    CREATE A DATASET OF ASSIGNEES
                    tempfile original
                    save `original'
                    keep assignee_id
                    rename assignee_id assignee
                    duplicates drop
                    tempfile assignees
                    save `assignees'
                    
                    use `original', clear
                    duplicates drop
                    sort assignee_id inventor_id
                    tempfile collaborations
                    save `collaborations'
                    
                    capture program drop one_pair
                    program define one_pair
                        local collaborations = collaborations[1]
                        assert _N == 1
                        expand 2
                        replace assignee1 = assignee2 in 2
                        drop assignee2
                        rename assignee1 assignee_id
                        merge 1:m assignee_id using `collaborations', keep(match)
                        by inventor_id (assignee_id), sort: gen status = _N if _n == 1
                        assert inlist(status, 1, 2) | missing(status)
                        count if status == 2
                        local shared_inv = r(N)
                        count if !missing(status)
                        local total_inv = r(N) 
                        by assignee_id, sort: keep if _n == 1
                        keep assignee_id
                        gen assignee2 = assignee_id[2]
                        rename assignee_id assignee1
                        keep in 1
                        gen total_inv = `total_inv'
                        gen shared_inv = `shared_inv'
                        exit
                    end
                    
                    capture program drop one_assignee
                    program define one_assignee
                        rename assignee assignee1
                        local assignees = assignees_file[1]
                        cross using `assignees'
                        rename assignee assignee2
                        keep if assignee1 > assignee2
                        if _N > 0 {
                            runby one_pair, by(assignee1 assignee2)
                        }
                        exit
                    end
                    
                    use `assignees', clear
                    gen assignees_file = "`assignees'"
                    gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
                    runby one_assignee, by(assignee) status
                    There is still a -cross- command involved, but when it is invoked, there is only a single observation in the active data set, so the "cross" is really just bringing in the full list of assignees and propagating the one assignee in focus at the moment into pairing up with all other assignees. In other words, this version of the code works by picking a single assignee at a time, and then pairing that one with every other, and then determining for each pair how many inventors they have worked with in total, and in common.

                    So if the modification I suggested in #7 did not work for you, use this version.

                    Comment


                    • #11
                      Thinking a little more deeply about this, it may be that even #10 won't save the day. While I'm quite sure the -cross- problem is solved in #10, because the -cross- there only creates a data set that is the size of your original input data set (in terms of # of observations), the final results data set will contain, per your problem specification, one observation for every distinct pair of assignees. If the final results data set is itself too large for Stata, then nothing you can do will give you the results you are looking for.

                      If you have k distinct assignees in your data set, that implies k*(k-1)/2 distinct pairs of assignees. Now, the -cross- in #7, creates a data set with k2 observations, which is only a little more than twice as big as the final answer data set. I don't know whether -cross- will just barely fail in #7 or if the -cross- result there is way more than twice as big as a Stata data set permits. But you can figure it out. Load your data set and run:
                      Code:
                      by assignee_id, sort: gen long counter = 1 if _n == 1
                      replace counter = sum(counter)
                      local k = counter[_N]
                      display "There are `k' distinct assignees in the data."
                      display "There are `=`k'*(`k'-1)/2' distinct assignee pairs in the data."
                      Compare the number of distinct assignee pairs with the maximum allowable number of observations (see -help limits- for the exact numbers) and you will know whether this is even possible.

                      If it's not possible, you may have to look for other software to do this. Or perhaps the problem itself can be "donwsized." For example, if you are really only interested in those assignee pairs that have some inventors in common, which, if your example is representative, is probably a very small fraction of all pairs, then it is easy enough to shrink the final answer data to just those pairs that do have at least one shared inventor by adding -drop if shared_inv == 0- just before the -exit- command in program one_pair. Or perhaps there is some sensible and manageable way to put some pairs in one answer data set and other pairs in others.

                      Comment

                      Working...
                      X