Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a square matrix using two relational variables

    Hi,

    I've spent 5-6 hours browsing the help files and past posts, but could not find the answer to what seems to be a very simple procedure. I used UCINet and Pajek before and want to switch to STATA working with matrices.

    I have two relational variables: students and classes that they are enrolled in. I want to create a square class-by-class matrix that would show the number of students in common in each class pair. This is very similar to the classic Southern Women study, albeit with a much larger data set. The initial two-variable dataset looks like this:

    class student
    1 1
    1 2
    1 3
    1 4
    1 5
    2 2
    2 3
    2 5
    3 1
    3 5
    4 2
    4 4
    4 5

    I want to end up with (in the above example) 4x4 class matrix (with zeros in the diagonal). I appreciate any suggestion for Stata, Mata, or nwcommands.

    Cheers,

    Mikhail
    Last edited by Mikhail Balaev; 15 Jan 2020, 22:21.

  • #2
    Well, I don't know how well this solution will scale to a very large data set given that it uses -cross- and -reshape-. But it gives the information required:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(class student)
    1 1
    1 2
    1 3
    1 4
    1 5
    2 2
    2 3
    2 5
    3 1
    3 5
    4 2
    4 4
    4 5
    end
    
    fillin class student
    replace _fillin = !_fillin
    rename _fillin has_student
    
    reshape wide has_student, i(class) j(student)
    
    tempfile copy
    save `copy'
    rename _all =A
    cross using `copy'
    ds *A, not
    rename (`r(varlist)') =B
    
    gen long obs_no = _n
    reshape long has_student@A has_student@B, i(obs_no) j(student)
    gen byte common = has_studentA & has_studentB
    
    collapse (sum) common, by(classA classB)
    If you really want this as a 4x4 matrix, you can get that with -reshape wide- and mkmat. But unless you will actually be doing matrix algebra with these results, you are probably better off leaving it as is--this is the layout of the data that will be easiest to work with for most purposes.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.


    Comment


    • #3
      Thank you, Clyde, but my dataset has 116,000 student-class observations, so reshaping it wide will exceed my Stata SE number of variables. I also do need a square matrix with a zero diagonal for further network analysis, centrality measures, and graphs.

      Comment


      • #4
        I take it that you've already googled something like stata social network analysis and there's nothing out there that's been written to do this.

        Comment


        • #5
          I wonder if the discussion in this thread may be useful: https://www.statalist.org/forums/for...ise-agreements

          Comment


          • #6
            Hi Joseph, sure, and surprisingly I did not find how to convert two variables into a relational network matrix. The only solution I found was
            tab exam1 student_id, matcell(ex_st)
            matrix exam=ex_st*ex_st'

            However, this matrix contains values instead of zeros in the diagonal (something UCINet and Pajek have options for setting to zeros to eliminate the nodes ties to themselves). The main issue is that due to the size of the dataset Stata cannot execute "tab" command.

            Comment


            • #7
              Hi John, thank you for the link. They already begin with a matrix A(i,j). I want to get that matrix. If I could get it, then I would just transpose it, multiplied by itself, and got what I needed. So, I'm still trying to figure out how to get a matrix A(i,j) from two relational variables.

              Comment


              • #8
                I'm an admirer but not exactly a user of -nwcommands- (-search nwcommands-). Its -nw2fromedge- command, while making a network, will create a Mata matrix in which element i,j contains the number of students that class i and class j have in common. Is this correct for you?

                Code:
                clear
                nwclear
                mata mata clear
                input class student
                1 1
                1 2
                1 3
                1 4
                1 5
                2 2
                2 3
                2 5
                3 1
                3 5
                4 2
                4 4
                4 5
                end
                list
                //
                //
                // The following command comes from -nwcommands-.  Among other things, it creates a Mata matrix
                named nw_mata1, which I believe has what you need.
                nw2fromedge class student, project(1)
                mata: nw_mata1
                I think, however, that -nwcommands- calculates its centrality measures using its own network names. If you want one of its built-in network centrality measures, I'm not sure that you will need to manipulate the nw_mata1 matrix yourself. You might need some different options on the nw2fromedge command in order to get the network that you would like nwcommands to work on.

                Comment


                • #9
                  Thanks Mike. nw2fromedge would be what I need, however Stata crashes with the system error that it runs out of memory while trying to generate a square matrix with my dataset of 116,000 student-class observations. Memory, number of variables, and matrix sizes are all set to maximum, Stata SE.
                  Last edited by Mikhail Balaev; 17 Jan 2020, 11:58.

                  Comment


                  • #10
                    In terms of the possibility of other do-it-yourself approaches, it would be relevant to know how many students there, and how many classes there are. Depending on those facts, I think it's possible that what you want might be solved within memory limitations. The solution will have to be in Mata, I think, although the results might be manageable as a Stata data set. I would encourage you to move your thread to the Mata part of the forum. What you want is related to the agreement problem that John Mullahy mentioned above as discussed in the Mata forum, but it's not quite identical.

                    Comment


                    • #11
                      Mikhail Balaev, check whether you can apply my looping solution in #2 of the following link. It appears that your question is similar to that posed in that thread.

                      https://www.statalist.org/forums/for...th-two-columns

                      Comment


                      • #12
                        I had forgotten about that other thread. Maybe there is a good approach without Mata. If the number of classes is not too big, my approach using -joinby- from that other thread might work:
                        Code:
                        input class student
                        1 1
                        1 2
                        1 3
                        1 4
                        1 5
                        2 2
                        2 3
                        2 5
                        3 1
                        3 5
                        4 2
                        4 4
                        4 5
                        end
                        preserve
                        rename class otherclass
                        tempfile temp
                        save `temp'
                        restore
                        joinby student using `temp'
                        keep if class < otherclass // duplicates
                        drop student  // no longer needed so save space
                        bysort class otherclass: gen with_other  = _N  // number of students this pair shares
                        by class otherclass: keep if _n == 1
                        list
                        reshape wide with_other, i(class) j(otherclass)

                        Comment


                        • #13
                          Andrew Musau - I have 116,000 student-class observations, so when I ran your code, adjusted to my data, I stopped the computation process some 15 minutes. I think it will work great like in the example in the thread, but not in my case, unfortunately.

                          Mike Lacy - Unfortunately with 696 classes this table would not work.
                          But, I saw your code in the other thread that Andrew mentioned and it did work!! Though I must say that I don't understand a good part of it - need to learn that. I am also not sure how to name the columns same as rows. In the rows I have my exam codes, but in the columns I have "attend1...attend695. Here is the code edited for my variables:

                          preserve
                          rename exam_code exam_code2
                          tempfile temp
                          save `temp'
                          restore
                          joinby student_id using `temp'
                          gen byte attend = 1
                          // Count co-attendances
                          collapse (sum) attend , by(exam_code exam_code2)
                          replace attend = 0 if (exam_code == exam_code2) // self
                          // Adjacency format
                          reshape wide attend , i(exam_code) j(exam_code2)
                          recode attend* (.=0)
                          // Export
                          export excel using "YourExcel.xlsx", firstrow(variables) replace

                          Comment


                          • #14
                            Originally posted by Mikhail Balaev View Post
                            nw2fromedge would be what I need, however Stata crashes with the system error that it runs out of memory while trying to generate a square matrix with my dataset of 116,000 student-class observations.
                            How many unique classes do you have? If that command relies on Mata, there shouldn't be a limitation on the matrix size other than your machine's memory. (I don't know whether Mata will allow memory caching for matrixes that are too large to fit into available RAM.)

                            I don't know anything about -nw2fromedge- or what it creates and returns in addition to the matrix you're looking for that might be the problem. But you can try something like what's attached, which creates just the matrix your looking for as the largest data structure. Its syntax is illustrated in the attached do-file, but it's called from the Stata command line via a static member function with
                            Code:
                            mata:Main::beginHere(<student variable name>, <class variable name>, <name of square matrix returned to Stata>)
                            If the number of classes exceeds 11 000 (i.e., won't fit into a Stata matrix), then you can write a little Mata ditty that calls it and then diverts the matrix to a file or whatnot.
                            Attached Files

                            Comment


                            • #15
                              Joseph Coveney Hi Joseph, I ran your script, but hit a break after an hour of computing. I have 696 classes and 116,000 student-class observations. The way Stata goes about computing the class-by-class matrix seems to be just too inefficient. Anyways, the above solution worked.

                              Comment

                              Working...
                              X