Creating all possible pairs within a single variable

Mark Weiss

Join Date: Sep 2022

Posts: 23
#1

Creating all possible pairs within a single variable

01 Mar 2023, 13:29

Hi all -

Suppose I have data that looks like this:

group identifier var1 var2

A 1

B 0

C 1

C 1

D 0

The values of Var1, Var2 are unimportant.

I want to regress Var1 on Var2 among a sample only restricted to certain pairs within the group variable. More specifically, I want each group for which identifier == 0 to be paired with all groups for which identifier == 1, one at a time. Here, B would be paired with A and C, and D would be paired with A and C as well. A and C would not be a pair and neither would B and D.

There are 32 groups in the actual data, about half of which have identifier == 0, though, so writing

reg var1 var2 if inlist(group, A, B)
reg var1 var2 if inlist(group, B, C)

for all permutations is horribly inefficient. Is there a way to assign a numeric value (call it id) to each pair, so that instead I can write something to the effect of

forvalues i = 1/n {
qui reg var1 var2 if id == `i'
est sto var1_`i'
}

Most methods I've been able to find only work if the pairs are formed from observations in two separate variables, whereas I want to create groups within a variable.

Happy to clarify anything if my question is confusing.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

01 Mar 2023, 13:49

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str2 group byte(identifier var1 var2) "A " 1 . . "B " 0 . . "C " 1 . . "C " 1 . . "D " 0 . . end preserve keep if identifier == 0 rename group group0 tempfile zero save `zero' restore keep if identifier == 1 rename group group1 cross using `zero' order group0, after(group1) egen id = group(group0 group1)

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Mark Weiss

Join Date: Sep 2022

Posts: 23
#3

01 Mar 2023, 14:02

Thanks Clyde! And will do, re: using -dataex-.

I do have a problem, though. When I ran your code, I got an error message that says "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations."

I'm no expert, but that seems like a few too many observations. Any idea how to rectify this? for context, my dataset contains ~500k observations at present.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#4

01 Mar 2023, 15:06

Probably not! Whether this is feasible depends on the distribution of the zeros and ones in the data. If there are about 250,000 of each, then to pair up each 0 with each 1 requires creating a data set of size 250,000*250,000, which is 6.25*10¹⁰ observations--and that is, indeed, larger than the maximum allowed. There is no way around this. If it's a bit more imbalanced, then the total required would be smaller. The best case scenario would be 499,999 zeroes and 1 one (or the other way around), which would require only the original size of the data set. But you need a very lopsided zero-one distribution to keep the number small.

In any event, unless you can shrink your data set (by a lot), there is no way to code around this. The problem is not with some intermediate data set being created that will later be pruned down to size. The problem is that the end result you are asking for inherently is too large to be contained in a Stata data set.
Comment
Mark Weiss

Join Date: Sep 2022

Posts: 23
#5

02 Mar 2023, 08:43

Well, not the answer I was hoping for, admittedly...

I think I found a workaround, with some slight issues.

The data is a survey, and respondents can come from one of fifty states (fewer than fifty are ultimately in the sample). So my thinking was, collapse by state (formerly 'group') and identify the pairs that way. Var1 and Var2 in this case are coordinates for the capitals of the states, so collapsing and taking their mean actually has no effect on their values.

I modified your code and wrote:

Code:

preserve collapse var1 var2 identifier, by(group) keep if identifier == 0 rename group group0 tempfile zero save `zero' restore keep if identifier == 1 rename group group1 cross using `zero' order group0, after(group1) egen id = group(group0 group1)

The only issue here is that I only get var1 and var2 for one of the states (groups) when I need them for both. Do you know of a way to fix this? Normally I'd work it out on my own but I can't seem to find much support for the cross command.

Last edited by Mark Weiss; 02 Mar 2023, 08:53.
Comment
Mark Weiss

Join Date: Sep 2022

Posts: 23
#6

02 Mar 2023, 09:04

This is a quick fix, as it turns out - just rename var1 and var2 as well. They had a common name before and merged when the two datasets were crossed.

Thanks Clyde! Appreciate the help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#7

02 Mar 2023, 10:17

There is another issue. You should do the -collapse- before the -preserve-. I think the code should be:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str2 group byte(identifier var1 var2) "A " 1 . . "B " 0 . . "C " 1 . . "C " 1 . . "D " 0 . . end collapse var1 var2 identifier, by(group) preserve keep if identifier == 0 rename group group0 rename (var1 var2) =_0 tempfile zero save `zero' restore keep if identifier == 1 rename group group1 rename (var1 var2) =_1 cross using `zero' order group0, after(group1) egen id = group(group0 group1)

Added: You originally said you wanted to regress var1 on var2. Now you say that these variables are the longitude and latitude of the state capital. What is the purpose of regressing the longitude on the latitude of the state capital? I'm curious what the research goal is.

Last edited by Clyde Schechter; 02 Mar 2023, 10:20.
Comment
Mark Weiss

Join Date: Sep 2022

Posts: 23
#8

02 Mar 2023, 13:06

Var1 and Var2 were placeholders that, throughout the course of discussing the problem, changed. At first, they were stand-ins for other variables, but then I needed to preserve the lat and long and so I changed what they represented to see if we could solve the problem.

Apologies, this was obviously bad form - but I didn't think it would cause any issues with the code itself!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#9

02 Mar 2023, 13:59

No need to apologize. You are quite write that it causes no issues with the code. The only reason I raised the question in #7 is just curiosity.
Comment

group	identifier	var1	var2
A	1
B	0
C	1
C	1
D	0

Announcement

Creating all possible pairs within a single variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment