Loop to avoid using more variables than Stata allows

John Kirk

Join Date: Apr 2017
Posts: 57

Loop to avoid using more variables than Stata allows

20 Mar 2019, 16:49

Hi everyone,

I am working with a lot of data and could really use your help. I am trying to create a pairwise measure for some variables for which I create a matrix out of two categorical variables (one of which I then turn into dummy variables for inventors, so a 1 is indicating that assignee i has worked that inventor). However, I can't simply create dummies for all the categories of the inventor variable, because those would be more variables than stata would allow me to use. Is there a way for me to write a loop that always takes two assignees at a time and then creates my similarity measure for these and in the end adds everything together?

Given my data example, below is the code that I have been working with on smaller samples. What I want to do from here is have the code loop through a few pairs of assignees at a time, somehow store the results (in pairwise format) and merge them, so that in the end I will get a data set which has three variables | assignee_id_1 assignee_id_2 total_inv | where the last variable is the measure denoting something along the lines of the total shared inventors. Note that the first part of the code (under **manual dummy creator) won't work in my scenario because there are two many variables being created and would have to be incorporated into the loop:

I have also added a data example at the very bottom. Please do let me know should anything be unclear and I will try to resolve any confusion as quickly as possible. I would really appreciate any help I could get.

Thanks,
John

Code:

*********************************
encode inventor_id, gen(enc_inv_id)
**manual dummy creator
levelsof enc_inv_id, local(n)
foreach j in `n'{
gen d_`j'= `j' == enc_inv_id
}

***create matrix
collapse(max) d_*, by(assignee_id)



******
******Jaccard
******
******
******Jaccard
******
preserve
rename * *_inv
tempfile id2_inv
save `id2_inv'
restore
cross using `id2_inv'
drop if assignee_id== assignee_id_inv
drop if assignee_id< assignee_id_inv

*number is 30325 for 500
forval i=1/30235{
capture gen tinv_`i'= cond((d_`i'+d_`i'_inv) > 1, 1, (d_`i'+d_`i'_inv))
}

egen total_inv= rowtotal(tinv_*)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 inventor_id str36 assignee_id
"10082632-2" "org_LzsrUkjiWAIjd5v2paoI"
"7107498-1"  "org_kSu9KMJQP5Qjd1ksQEfJ"
"7298882-3"  "org_Re0fk3jNFREw0dcTKfbh"
"5493647-7"  "org_vhtWu21o8nVXJPQm82Rg"
"10050270-2" "org_vhtWu21o8nVXJPQm82Rg"
"10072529-1" "org_vhtWu21o8nVXJPQm82Rg"
"10085369-1" "org_0b1dsqimqIilzoArXt5E"
"10085369-1" "org_0b1dsqimqIilzoArXt5E"
"10079282-2" "org_vhtWu21o8nVXJPQm82Rg"
"10066281-3" "org_D6QsTT7TAi7VCJKGWsch"
"7797813-4"  "org_KkXCLSsnWnBXYx4AkNuK"
"10038545-4" "org_ISzfczIbBwtvODSmyDh0"
"6020252-1"  "org_SIFMk7wufuE8EAAX00co"
"7397516-1"  "org_0b1dsqimqIilzoArXt5E"
"10044878-2" "org_Z5OO886eRu5q4OFbfJki"
"6041748-2"  "org_1zuIlrebuviyK6v0YrBW"
"5973456-2"  "org_0b1dsqimqIilzoArXt5E"
"8132276-1"  "org_Re0fk3jNFREw0dcTKfbh"
"9970057-2"  "org_MXADrSM40Tb8jm6v4hYe"
"6583036-1"  "org_hke26dncXOrqlNsMhgUL"
"10000438-1" "org_J6ZIJIkyzJ4mfvUJ0EPZ"
"8550478-4"  "org_UVkz6sbdutjKFh5akYfA"
"6615990-3"  "org_1zuIlrebuviyK6v0YrBW"
"7594444-5"  "org_69z9bh3CjxKHL2yOCxil"
"5287700-2"  "org_wHd57MIpQIC17zih5MDg"
"5140178-1"  "org_vhtWu21o8nVXJPQm82Rg"
"5432481-1"  "org_vhtWu21o8nVXJPQm82Rg"
"10014593-1" "org_wHd57MIpQIC17zih5MDg"
"7324670-1"  "org_vhtWu21o8nVXJPQm82Rg"
"10040763-3" "org_OeOCSMKvIyMnrOBiLUWE"
"10040763-3" "org_OeOCSMKvIyMnrOBiLUWE"
"7805565-3"  "org_STc1KHT2TAIOvxitTruU"
"9878450-7"  "org_wHd57MIpQIC17zih5MDg"
"5281520-3"  "org_l2aE7VeZKBQmMw8FsMro"
"5281520-3"  "org_l2aE7VeZKBQmMw8FsMro"
"10135172-1" "org_LzuWdBdhsJGSOccRyR7c"
"3930622-1"  "org_vhtWu21o8nVXJPQm82Rg"
"9112343-6"  "org_wHd57MIpQIC17zih5MDg"
"10002621-1" "org_ISzfczIbBwtvODSmyDh0"
"10002621-1" "org_ISzfczIbBwtvODSmyDh0"
"10002621-1" "org_ISzfczIbBwtvODSmyDh0"
"8951673-1"  "org_hwwhVrm6DmcUHmptTCv2"
"8951673-1"  "org_hwwhVrm6DmcUHmptTCv2"
"5746724-3"  "org_gzQbESqZ6fQ4qpVglKBf"
"5632663-1"  "org_StddogfTkajU6bPvEq59"
"10025048-1" "org_hke26dncXOrqlNsMhgUL"
"10076373-1" "org_bl4IwhdXgo8T0BvN5rgl"
"10012962-4" "org_67H6kwB9En0b4zX9LlgF"
"7312330-5"  "org_7Ii76o0QYwYxPzsgHLSD"
"10100121-3" "org_DyNSIsS5orV1ZMcyOuCN"
end

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

20 Mar 2019, 18:33

How many different assignees does your actual data have? That would affect what is a reasonable solution.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30103

20 Mar 2019, 18:53

Unfortunately, the example data you show does not contain any instances of a pair of assignees having any inventors in common. To test my code I had to modify your data. I hope my test will cover the full generality of possibilities in your data.

Code:

//    CREATE A DATASET OF ASSIGNEE PAIRS
tempfile original
save `original'
keep assignee
tempfile copy
save `copy'
rename assignee_id assignee1
cross using `copy'
rename assignee_id assignee2
tempfile pairs
keep if assignee1 > assignee2
duplicates drop
save `pairs'

use `original', clear
duplicates drop
sort assignee_id inventor_id
tempfile collaborations
save `collaborations'

capture program drop one_pair
program define one_pair
    local collaborations = collaborations[1]
    gen assignee_id = assignee1
    merge 1:m assignee_id using `collaborations', keep(match) nogenerate
    replace assignee_id = assignee2
    merge 1:m assignee_id inventor_id using `collaborations', keep(match master)
    count if _merge == 3
    gen total_inv = r(N)
    keep in 1
    keep assignee1 assignee2 total_inv
    exit
end


use `pairs', clear
gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
runby one_pair, by(assignee1 assignee2) status

-runby- is written by Robert Picard and me, and is available from SSC.

Added: In editing, I added the boldfaced -sort- command; it is not necessary to get correct results but it decreases the execution time required enormously. By sorting the `collaborations' file once and for all, you avoid having to sort it (twice!) every time program one_pair is called by -runby-. Even on a small example such as the one posted, the time savings is huge.

Last edited by Clyde Schechter; 20 Mar 2019, 19:04.

Comment

John Kirk

Join Date: Apr 2017

Posts: 57
#4

21 Mar 2019, 14:06

Mike Lacy I am currently working with a sample of 2000 assignees, but am hoping to scale that number up higher later in the process. I have data on around 50k.

Clyde Schechter thank you so much for the code. I was not aware of the runby command, but it seems to do exactly what I am looking for. Your help is highly appreciated. If i might ask a follow up question: I am just getting started working with programming function, so I am unsure how to add an extra part into your code. I was hoping to just be able to add that myself after having gotten started, but since you use a slightly different syntax (using the r(N) function, which seems smarter), I have found myself getting stuck again. Apologies for having to ask twice. After having assigned the total_inv measure, I want to use it as the denominator in a ratio measuring shared inventors over all inventors. I am, now, struggling to recreate the code for the shared inventors. To illustrate, I have added my previous code below (at least the one starting from the total_inv measure). Again, I am now wondering how recreate the loop and egen command that subsequently create the variable I had called "count_inv".

Again, thanks so much for your help!

Code:

*This part you had already illustrated nicely before. forval i=1/16335{ capture gen tinv_`i'= cond((d_`i'+d_`i'_inv) > 1, 1, (d_`i'+d_`i'_inv)) } egen total_inv= rowtotal(tinv_*) **this is the part I am struggling to add forval i=1/16335{ capture replace tinv_`i'= cond((d_`i'+d_`i'_inv) == 2, 1, 0) } egen count_inv= rowtotal(tinv_*) gen similarity_inv= (count_inv/total_inv)*100
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30103

21 Mar 2019, 16:16

First, let's be clear: the variable calculated by my code in #3 is the number of inventors shared by both members of the pair of assignees. I perhaps erred in calling it tot_inv. If you want another variable that really is the total number of inventors associated to either assignee, we have to modify the code a bit.

Code:

//    CREATE A DATASET OF ASSIGNEE PAIRS
tempfile original
save `original'
keep assignee
tempfile copy
save `copy'
rename assignee_id assignee1
cross using `copy'
rename assignee_id assignee2
tempfile pairs
keep if assignee1 > assignee2
duplicates drop
save `pairs'

use `original', clear
duplicates drop
sort assignee_id inventor_id
tempfile collaborations
save `collaborations'

capture program drop one_pair
program define one_pair
    local collaborations = collaborations[1]
    assert _N == 1
    expand 2
    replace assignee1 = assignee2 in 2
    drop assignee2
    rename assignee1 assignee_id
    merge 1:m assignee_id using `collaborations', keep(match)
    by inventor_id (assignee_id), sort: gen status = _N if _n == 1
    assert inlist(status, 1, 2) | missing(status)
    count if status == 2
    local shared_inv = r(N)
    count if !missing(status)
    local total_inv = r(N) 
    by assignee_id, sort: keep if _n == 1
    keep assignee_id
    gen assignee2 = assignee_id[2]
    rename assignee_id assignee1
    keep in 1
    gen total_inv = `total_inv'
    gen shared_inv = `shared_inv'
    exit
end


use `pairs', clear
gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
runby one_pair, by(assignee1 assignee2) status status

All of the changes to the code are found inside program one_pair. It now calculates both the total number of inventors that worked with either assignee (but does not double count any who worked with both) -- total_inv, and the number of inventors that the pair of assignees have in common -- shared_inv.

Comment

John Kirk

Join Date: Apr 2017

Posts: 57
#6

22 Mar 2019, 19:51

Thank you so much Clyde Schechter, that is exactly what I meant. My apologies if that was unclear.

Trying to use this code with some of the larger samples of my data, I have run into some new problems, however. When I reach the 'cross' command, I repeatedly get the error message that the "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations". I think I understand where the message is coming from, but am wondering how to work my way around it, as I am trying to find a way in which only part of the data is used at a time. I do have access to an MP version of Stata as well, but with some of the other parts of my data, I would expect the cross command to exceed the 20 billion threshold that MP has as well.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30103
#7

23 Mar 2019, 14:46

Well, the first thing I would try, which may or may not be sufficient to solve the problem, is to modify the code that creates the pairs file as follows:

Code:

// CREATE A DATASET OF ASSIGNEE PAIRS tempfile original save `original' keep assignee duplicates drop tempfile copy save `copy' rename assignee_id assignee1 cross using `copy' rename assignee_id assignee2 tempfile pairs keep if assignee1 > assignee2 duplicates drop save `pairs'

Many assignees appear more than once in the original data. In your example data, the average number of appearances per assignee is 1.8. Since the size of the crossed data is the square of the size of the original, this roughly 56% reduction in the data set of assignees will reduce the crossed data to about 1/3 of its current size.

If that does not suffice, then it will be necessary to partition the data at an earlier stage and that is somewhat complicated. So try the above first and see if it suffices.

Note, by the way, that the maximum number of observations in Stata MP is a little more than 1 trillion observations, so with the code as modified here you will only break the limit on the number of observations in the -cross- if the number of distinct assignees in your code is over 1 million. In fact, I would imagine that in MP you would bump up against the memory limits of the computer before that, if your computer is a typical business machine.

The part of the code that prepairs the pairs file is the only one that uses the data as a whole. The rest of the code feeds a single pair of assignees at a time into memory and then brings in only those inventors that worked with them, never anything more. The large file of pairs is actually not in memory: it is held in a Mata matrix (the size of which is limited only by available memory).
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

23 Mar 2019, 16:29

Perhaps I'm misunderstanding, but can't one use -joinby- here rather than -cross-? My understanding of this problem was that the relevant assignee1 X assignee2 pairs were those that shared an inventor. Wouldn't it therefore work to do (conceptually):

Code:

use assignee1 joinby inventor using assignee2

This produces a file of order N_inventor rather than (N_assignee)^2, although it might take a bigger file in process. And, it's not as fast as -runby-, but perhaps it would avoid the memory limits. This seems too simple, so my apologies if I'm distracting you all from the right approach.

Last edited by Mike Lacy; 23 Mar 2019, 16:32.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30103
#9

23 Mar 2019, 16:52

Mike Lacy In the original post, that would have worked. But now we also have to count all inventors who worked for either assignee, not just the ones they had in common, and we have to do that for every possible pair of assignees.
1 like
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30103

#10

24 Mar 2019, 12:52

Actually, I realized overnight that it isn't as complicated to avoid the large cross as I imagined. This works:

Code:

//    CREATE A DATASET OF ASSIGNEES
tempfile original
save `original'
keep assignee_id
rename assignee_id assignee
duplicates drop
tempfile assignees
save `assignees'

use `original', clear
duplicates drop
sort assignee_id inventor_id
tempfile collaborations
save `collaborations'

capture program drop one_pair
program define one_pair
    local collaborations = collaborations[1]
    assert _N == 1
    expand 2
    replace assignee1 = assignee2 in 2
    drop assignee2
    rename assignee1 assignee_id
    merge 1:m assignee_id using `collaborations', keep(match)
    by inventor_id (assignee_id), sort: gen status = _N if _n == 1
    assert inlist(status, 1, 2) | missing(status)
    count if status == 2
    local shared_inv = r(N)
    count if !missing(status)
    local total_inv = r(N) 
    by assignee_id, sort: keep if _n == 1
    keep assignee_id
    gen assignee2 = assignee_id[2]
    rename assignee_id assignee1
    keep in 1
    gen total_inv = `total_inv'
    gen shared_inv = `shared_inv'
    exit
end

capture program drop one_assignee
program define one_assignee
    rename assignee assignee1
    local assignees = assignees_file[1]
    cross using `assignees'
    rename assignee assignee2
    keep if assignee1 > assignee2
    if _N > 0 {
        runby one_pair, by(assignee1 assignee2)
    }
    exit
end

use `assignees', clear
gen assignees_file = "`assignees'"
gen collaborations = "`collaborations'" // CAN'T PASS MACRO INTO PROGRAM, STORE IT HERE
runby one_assignee, by(assignee) status

There is still a -cross- command involved, but when it is invoked, there is only a single observation in the active data set, so the "cross" is really just bringing in the full list of assignees and propagating the one assignee in focus at the moment into pairing up with all other assignees. In other words, this version of the code works by picking a single assignee at a time, and then pairing that one with every other, and then determining for each pair how many inventors they have worked with in total, and in common.

So if the modification I suggested in #7 did not work for you, use this version.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30103
#11

24 Mar 2019, 13:20

Thinking a little more deeply about this, it may be that even #10 won't save the day. While I'm quite sure the -cross- problem is solved in #10, because the -cross- there only creates a data set that is the size of your original input data set (in terms of # of observations), the final results data set will contain, per your problem specification, one observation for every distinct pair of assignees. If the final results data set is itself too large for Stata, then nothing you can do will give you the results you are looking for.

If you have k distinct assignees in your data set, that implies k*(k-1)/2 distinct pairs of assignees. Now, the -cross- in #7, creates a data set with k² observations, which is only a little more than twice as big as the final answer data set. I don't know whether -cross- will just barely fail in #7 or if the -cross- result there is way more than twice as big as a Stata data set permits. But you can figure it out. Load your data set and run:

Code:

by assignee_id, sort: gen long counter = 1 if _n == 1 replace counter = sum(counter) local k = counter[_N] display "There are `k' distinct assignees in the data." display "There are `=`k'*(`k'-1)/2' distinct assignee pairs in the data."

Compare the number of distinct assignee pairs with the maximum allowable number of observations (see -help limits- for the exact numbers) and you will know whether this is even possible.

If it's not possible, you may have to look for other software to do this. Or perhaps the problem itself can be "donwsized." For example, if you are really only interested in those assignee pairs that have some inventors in common, which, if your example is representative, is probably a very small fraction of all pairs, then it is easy enough to shrink the final answer data to just those pairs that do have at least one shared inventor by adding -drop if shared_inv == 0- just before the -exit- command in program one_pair. Or perhaps there is some sensible and manageable way to put some pairs in one answer data set and other pairs in others.
1 like
Comment

Announcement

Loop to avoid using more variables than Stata allows

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment