Creating a square matrix using two relational variables

Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#1

Creating a square matrix using two relational variables

15 Jan 2020, 21:49

Hi,

I've spent 5-6 hours browsing the help files and past posts, but could not find the answer to what seems to be a very simple procedure. I used UCINet and Pajek before and want to switch to STATA working with matrices.

I have two relational variables: students and classes that they are enrolled in. I want to create a square class-by-class matrix that would show the number of students in common in each class pair. This is very similar to the classic Southern Women study, albeit with a much larger data set. The initial two-variable dataset looks like this:

class student
1 1
1 2
1 3
1 4
1 5
2 2
2 3
2 5
3 1
3 5
4 2
4 4
4 5

I want to end up with (in the above example) 4x4 class matrix (with zeros in the diagonal). I appreciate any suggestion for Stata, Mata, or nwcommands.

Cheers,

Mikhail

Last edited by Mikhail Balaev; 15 Jan 2020, 22:21.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

15 Jan 2020, 22:42

Well, I don't know how well this solution will scale to a very large data set given that it uses -cross- and -reshape-. But it gives the information required:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(class student) 1 1 1 2 1 3 1 4 1 5 2 2 2 3 2 5 3 1 3 5 4 2 4 4 4 5 end fillin class student replace _fillin = !_fillin rename _fillin has_student reshape wide has_student, i(class) j(student) tempfile copy save `copy' rename _all =A cross using `copy' ds *A, not rename (`r(varlist)') =B gen long obs_no = _n reshape long has_student@A has_student@B, i(obs_no) j(student) gen byte common = has_studentA & has_studentB collapse (sum) common, by(classA classB)

If you really want this as a 4x4 matrix, you can get that with -reshape wide- and mkmat. But unless you will actually be doing matrix algebra with these results, you are probably better off leaving it as is--this is the layout of the data that will be easiest to work with for most purposes.

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#3

15 Jan 2020, 23:27

Thank you, Clyde, but my dataset has 116,000 student-class observations, so reshaping it wide will exceed my Stata SE number of variables. I also do need a square matrix with a zero diagonal for further network analysis, centrality measures, and graphs.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#4

15 Jan 2020, 23:52

I take it that you've already googled something like stata social network analysis and there's nothing out there that's been written to do this.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 751
#5

16 Jan 2020, 07:26

I wonder if the discussion in this thread may be useful: https://www.statalist.org/forums/for...ise-agreements
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#6

16 Jan 2020, 07:54

Hi Joseph, sure, and surprisingly I did not find how to convert two variables into a relational network matrix. The only solution I found was
tab exam1 student_id, matcell(ex_st)
matrix exam=ex_st*ex_st'

However, this matrix contains values instead of zeros in the diagonal (something UCINet and Pajek have options for setting to zeros to eliminate the nodes ties to themselves). The main issue is that due to the size of the dataset Stata cannot execute "tab" command.
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#7

16 Jan 2020, 07:56

Hi John, thank you for the link. They already begin with a matrix A(i,j). I want to get that matrix. If I could get it, then I would just transpose it, multiplied by itself, and got what I needed. So, I'm still trying to figure out how to get a matrix A(i,j) from two relational variables.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

16 Jan 2020, 12:15

I'm an admirer but not exactly a user of -nwcommands- (-search nwcommands-). Its -nw2fromedge- command, while making a network, will create a Mata matrix in which element i,j contains the number of students that class i and class j have in common. Is this correct for you?

Code:

clear nwclear mata mata clear input class student 1 1 1 2 1 3 1 4 1 5 2 2 2 3 2 5 3 1 3 5 4 2 4 4 4 5 end list // // // The following command comes from -nwcommands-. Among other things, it creates a Mata matrix named nw_mata1, which I believe has what you need. nw2fromedge class student, project(1) mata: nw_mata1

I think, however, that -nwcommands- calculates its centrality measures using its own network names. If you want one of its built-in network centrality measures, I'm not sure that you will need to manipulate the nw_mata1 matrix yourself. You might need some different options on the nw2fromedge command in order to get the network that you would like nwcommands to work on.
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#9

17 Jan 2020, 11:26

Thanks Mike. nw2fromedge would be what I need, however Stata crashes with the system error that it runs out of memory while trying to generate a square matrix with my dataset of 116,000 student-class observations. Memory, number of variables, and matrix sizes are all set to maximum, Stata SE.

Last edited by Mikhail Balaev; 17 Jan 2020, 11:58.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#10

17 Jan 2020, 11:40

In terms of the possibility of other do-it-yourself approaches, it would be relevant to know how many students there, and how many classes there are. Depending on those facts, I think it's possible that what you want might be solved within memory limitations. The solution will have to be in Mata, I think, although the results might be manageable as a Stata data set. I would encourage you to move your thread to the Mata part of the forum. What you want is related to the agreement problem that John Mullahy mentioned above as discussed in the Mata forum, but it's not quite identical.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#11

17 Jan 2020, 12:18

Mikhail Balaev, check whether you can apply my looping solution in #2 of the following link. It appears that your question is similar to that posed in that thread.

https://www.statalist.org/forums/for...th-two-columns
1 like
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

#12

17 Jan 2020, 15:11

I had forgotten about that other thread. Maybe there is a good approach without Mata. If the number of classes is not too big, my approach using -joinby- from that other thread might work:

Code:

input class student
1 1
1 2
1 3
1 4
1 5
2 2
2 3
2 5
3 1
3 5
4 2
4 4
4 5
end
preserve
rename class otherclass
tempfile temp
save `temp'
restore
joinby student using `temp'
keep if class < otherclass // duplicates
drop student  // no longer needed so save space
bysort class otherclass: gen with_other  = _N  // number of students this pair shares
by class otherclass: keep if _n == 1
list
reshape wide with_other, i(class) j(otherclass)

Comment

Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#13

17 Jan 2020, 19:07

Andrew Musau - I have 116,000 student-class observations, so when I ran your code, adjusted to my data, I stopped the computation process some 15 minutes. I think it will work great like in the example in the thread, but not in my case, unfortunately.

Mike Lacy - Unfortunately with 696 classes this table would not work.
But, I saw your code in the other thread that Andrew mentioned and it did work!! Though I must say that I don't understand a good part of it - need to learn that. I am also not sure how to name the columns same as rows. In the rows I have my exam codes, but in the columns I have "attend1...attend695. Here is the code edited for my variables:

preserve
rename exam_code exam_code2
tempfile temp
save `temp'
restore
joinby student_id using `temp'
gen byte attend = 1
// Count co-attendances
collapse (sum) attend , by(exam_code exam_code2)
replace attend = 0 if (exam_code == exam_code2) // self
// Adjacency format
reshape wide attend , i(exam_code) j(exam_code2)
recode attend* (.=0)
// Export
export excel using "YourExcel.xlsx", firstrow(variables) replace
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#14

17 Jan 2020, 19:44

Originally posted by Mikhail Balaev View Post

nw2fromedge would be what I need, however Stata crashes with the system error that it runs out of memory while trying to generate a square matrix with my dataset of 116,000 student-class observations.

How many unique classes do you have? If that command relies on Mata, there shouldn't be a limitation on the matrix size other than your machine's memory. (I don't know whether Mata will allow memory caching for matrixes that are too large to fit into available RAM.)

I don't know anything about -nw2fromedge- or what it creates and returns in addition to the matrix you're looking for that might be the problem. But you can try something like what's attached, which creates just the matrix your looking for as the largest data structure. Its syntax is illustrated in the attached do-file, but it's called from the Stata command line via a static member function with

Code:

mata:Main::beginHere(<student variable name>, <class variable name>, <name of square matrix returned to Stata>)

If the number of classes exceeds 11 000 (i.e., won't fit into a Stata matrix), then you can write a little Mata ditty that calls it and then diverts the matrix to a file or whatnot.
Attached Files

NW.do (2.9 KB, 1 view)
Comment
Mikhail Balaev

Join Date: Aug 2015

Posts: 13
#15

19 Jan 2020, 14:08

Joseph Coveney Hi Joseph, I ran your script, but hit a break after an hour of computing. I have 696 classes and 116,000 student-class observations. The way Stata goes about computing the class-by-class matrix seems to be just too inefficient. Anyways, the above solution worked.
Comment

Announcement

Creating a square matrix using two relational variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment