Sequence analysis clusters with big data - errors

Agata Troost

Join Date: Oct 2019

Posts: 4
#1

Sequence analysis clusters with big data - errors

01 Apr 2020, 07:32

Hello,
I'm very new to sequence analysis and especially sequence clusters analysis and it's difficult to find detailed tutorials online for big datasets. I want to create clusters from sequences made of 1, 2, and 3 (so eg. 2222222... or 22223333322...), one number for each of the 18 years in the data (so each sequence's length is 18). From previous research and theory I know I want 7 clusters. I need to assign each individual to their cluster for a regression analysis later.

I have a big dataset with around 170,000 individuals, and about 20,000 individual sequences. (I can't share the data because of privacy regulations.)

Right now I'm using the following code:

Code:

sqom matrix dir sqclusterdat clustermat wardslinkage SQdist, name(myname) add cluster generate cluster = groups(7) sqclusterdat, return keep(cluster myname*)

after running clustermat:
error "unable to allocate real.... function returned error... r(2900);"

Is it because of the size of the dataset / number of sequences? What would you recommend to deal with this? I tried other options of sqom but it didn't help.

I also tried to run this analysis with R, but encountered errors (also most likely due to the size) at a similar step

Last edited by Agata Troost; 01 Apr 2020, 07:34. Reason: adding tags
Tags: big data, clusters, sequence analysis
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

02 Apr 2020, 11:06

You will increase your chances of useful answer by following the FAQ on asking questions – provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

When you can't share the actual data, and often when you can, it can be better to cook up some artificial data that illustrates the problem clearly in a reasonable number of observations. This takes more work on your part of course. It is also quite possible to mask data from individuals by, for example, replacing one number with a different number.

With a user written program, getting help on this list depends often on whether someone actively uses that program. It is trivial for you to figure out whether the problem is the size of the data set – try it on a small portion of the data set. Indeed, if the data set is as large as it sounds, you will often be well advised to do almost all of your programming and correcting of programming on a portion of the data rather than wait for it to run through an enormous data set. I don't know how this program works, but if it is setting up a matrix to handle this somehow that it is quite likely that you are exceeding the allowable matrix size. For this kind of question, you can either try to open the ado file and figure out what's going on or contact the programs authors.
Comment

Announcement

Sequence analysis clusters with big data - errors

Comment