Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence analysis clusters with big data - errors

    Hello,
    I'm very new to sequence analysis and especially sequence clusters analysis and it's difficult to find detailed tutorials online for big datasets. I want to create clusters from sequences made of 1, 2, and 3 (so eg. 2222222... or 22223333322...), one number for each of the 18 years in the data (so each sequence's length is 18). From previous research and theory I know I want 7 clusters. I need to assign each individual to their cluster for a regression analysis later.

    I have a big dataset with around 170,000 individuals, and about 20,000 individual sequences. (I can't share the data because of privacy regulations.)

    Right now I'm using the following code:
    Code:
    sqom
    matrix dir
    sqclusterdat
    clustermat wardslinkage SQdist, name(myname) add
    cluster generate cluster = groups(7)
    sqclusterdat, return keep(cluster myname*)
    after running clustermat:
    error "unable to allocate real.... function returned error... r(2900);"

    Is it because of the size of the dataset / number of sequences? What would you recommend to deal with this? I tried other options of sqom but it didn't help.



    I also tried to run this analysis with R, but encountered errors (also most likely due to the size) at a similar step
    Last edited by Agata Troost; 01 Apr 2020, 07:34. Reason: adding tags

  • #2
    You will increase your chances of useful answer by following the FAQ on asking questions – provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

    When you can't share the actual data, and often when you can, it can be better to cook up some artificial data that illustrates the problem clearly in a reasonable number of observations. This takes more work on your part of course. It is also quite possible to mask data from individuals by, for example, replacing one number with a different number.

    With a user written program, getting help on this list depends often on whether someone actively uses that program. It is trivial for you to figure out whether the problem is the size of the data set – try it on a small portion of the data set. Indeed, if the data set is as large as it sounds, you will often be well advised to do almost all of your programming and correcting of programming on a portion of the data rather than wait for it to run through an enormous data set. I don't know how this program works, but if it is setting up a matrix to handle this somehow that it is quite likely that you are exceeding the allowable matrix size. For this kind of question, you can either try to open the ado file and figure out what's going on or contact the programs authors.

    Comment

    Working...
    X