Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating all possible pairs within a single variable

    Hi all -

    Suppose I have data that looks like this:
    group identifier var1 var2
    A 1
    B 0
    C 1
    C 1
    D 0
    The values of Var1, Var2 are unimportant.

    I want to regress Var1 on Var2 among a sample only restricted to certain pairs within the group variable. More specifically, I want each group for which identifier == 0 to be paired with all groups for which identifier == 1, one at a time. Here, B would be paired with A and C, and D would be paired with A and C as well. A and C would not be a pair and neither would B and D.

    There are 32 groups in the actual data, about half of which have identifier == 0, though, so writing

    reg var1 var2 if inlist(group, A, B)
    reg var1 var2 if inlist(group, B, C)

    for all permutations is horribly inefficient. Is there a way to assign a numeric value (call it id) to each pair, so that instead I can write something to the effect of

    forvalues i = 1/n {
    qui reg var1 var2 if id == `i'
    est sto var1_`i'
    }

    Most methods I've been able to find only work if the pairs are formed from observations in two separate variables, whereas I want to create groups within a variable.

    Happy to clarify anything if my question is confusing.

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str2 group byte(identifier var1 var2)
    "A " 1 . .
    "B " 0 . .
    "C " 1 . .
    "C " 1 . .
    "D " 0 . .
    end
    
    preserve
    keep if identifier == 0
    rename group group0
    tempfile zero
    save `zero'
    
    restore
    keep if identifier == 1
    rename group group1
    cross using `zero'
    order group0, after(group1)
    
    egen id = group(group0 group1)
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thanks Clyde! And will do, re: using -dataex-.

      I do have a problem, though. When I ran your code, I got an error message that says "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations."

      I'm no expert, but that seems like a few too many observations. Any idea how to rectify this? for context, my dataset contains ~500k observations at present.

      Comment


      • #4
        Probably not! Whether this is feasible depends on the distribution of the zeros and ones in the data. If there are about 250,000 of each, then to pair up each 0 with each 1 requires creating a data set of size 250,000*250,000, which is 6.25*1010 observations--and that is, indeed, larger than the maximum allowed. There is no way around this. If it's a bit more imbalanced, then the total required would be smaller. The best case scenario would be 499,999 zeroes and 1 one (or the other way around), which would require only the original size of the data set. But you need a very lopsided zero-one distribution to keep the number small.

        In any event, unless you can shrink your data set (by a lot), there is no way to code around this. The problem is not with some intermediate data set being created that will later be pruned down to size. The problem is that the end result you are asking for inherently is too large to be contained in a Stata data set.

        Comment


        • #5
          Well, not the answer I was hoping for, admittedly...

          I think I found a workaround, with some slight issues.

          The data is a survey, and respondents can come from one of fifty states (fewer than fifty are ultimately in the sample). So my thinking was, collapse by state (formerly 'group') and identify the pairs that way. Var1 and Var2 in this case are coordinates for the capitals of the states, so collapsing and taking their mean actually has no effect on their values.

          I modified your code and wrote:

          Code:
          preserve
          
          collapse var1 var2 identifier, by(group)  
          keep if identifier == 0
          rename group group0
          tempfile zero
          save `zero'  
          
          restore
          keep if identifier == 1
          rename group group1
          cross using `zero'
          
          order group0, after(group1)
           
          egen id = group(group0 group1)
          The only issue here is that I only get var1 and var2 for one of the states (groups) when I need them for both. Do you know of a way to fix this? Normally I'd work it out on my own but I can't seem to find much support for the cross command.
          Last edited by Mark Weiss; 02 Mar 2023, 08:53.

          Comment


          • #6
            This is a quick fix, as it turns out - just rename var1 and var2 as well. They had a common name before and merged when the two datasets were crossed.

            Thanks Clyde! Appreciate the help.

            Comment


            • #7
              There is another issue. You should do the -collapse- before the -preserve-. I think the code should be:
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str2 group byte(identifier var1 var2)
              "A " 1 . .
              "B " 0 . .
              "C " 1 . .
              "C " 1 . .
              "D " 0 . .
              end
              
              collapse var1 var2 identifier, by(group)  
              preserve
              
              keep if identifier == 0
              rename group group0
              rename (var1 var2) =_0
              tempfile zero
              save `zero'  
              
              restore
              keep if identifier == 1
              rename group group1
              rename (var1 var2) =_1
              cross using `zero'
              
              order group0, after(group1)
               
              egen id = group(group0 group1)
              Added: You originally said you wanted to regress var1 on var2. Now you say that these variables are the longitude and latitude of the state capital. What is the purpose of regressing the longitude on the latitude of the state capital? I'm curious what the research goal is.
              Last edited by Clyde Schechter; 02 Mar 2023, 10:20.

              Comment


              • #8
                Var1 and Var2 were placeholders that, throughout the course of discussing the problem, changed. At first, they were stand-ins for other variables, but then I needed to preserve the lat and long and so I changed what they represented to see if we could solve the problem.

                Apologies, this was obviously bad form - but I didn't think it would cause any issues with the code itself!

                Comment


                • #9
                  No need to apologize. You are quite write that it causes no issues with the code. The only reason I raised the question in #7 is just curiosity.

                  Comment

                  Working...
                  X