Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speed up model

    Hi,

    I am trying to find the best combination of number of groups and polynomials in a model. I am using the Stata plug-in traj. It permits group-based trajectory modeling. My problem is that it takes such a long time to iterate all the possible combinations. Any suggestions to speed up my code a bit?

    Dataset example.
    sofa_card_orig_* is a variable taking integer values between 0-4.
    time_* is just a dummy indicating time between the above measurements. It goes from 1-5

    My real dataset has 722 observations, all observed under 28 days, sofa_card_orig_1 - sofa_card_orig_28 and time_1-time_28.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(sofa_card_orig_1 sofa_card_orig_2 sofa_card_orig_3 sofa_card_orig_4 sofa_card_orig_5 time_1 time_2 time_3 time_4 time_5)
    0 0 0 0 0 1 2 3 4 5
    3 3 3 3 1 1 2 3 4 5
    3 1 . . . 1 2 3 4 5
    3 . . . . 1 2 3 4 5
    0 0 0 0 0 1 2 3 4 5
    end

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    set trace off
    set more off
    set matsize 11000
    
    
        * set 
    local groups 6
    local polynom 2
    local first "sofa_card_orig_1"
    local connect "-"
    local second "sofa_card_orig_28"
    
    
    
    
    local maxnum: display `polynom'^1+`polynom'^2+`polynom'^3+`polynom'^4+`polynom'^5+`polynom'^6
    matrix a = J(`maxnum',6,.)
    matrix colnames a = "Groups" "Polynomials" "BIC(N)" "BIC(panels)" "e(AIC)" "e(ll)"
    local bicn: display -10^99
    local bicp: display -10^99
    local e(BIC_N_data): display -10^99
    local e(BIC_N_subjects): display -10^99
    
    
    
    local i 1
    forvalues a = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'
    }
    
    
    local i `++i'
    forvalues b = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a' `b')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'`b'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'`b'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'`b'
    }
    
    
    local i `++i'
    forvalues c = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a' `b' `c')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'`b'`c'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'`b'`c'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'`b'`c'
    }
    
    
    local i `++i'
    forvalues d = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a' `b' `c' `d')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'`b'`c'`d'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'`b'`c'`d'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'`b'`c'`d'
    }
    
    local i `++i'
    forvalues e = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a' `b' `c' `d' `e')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'`b'`c'`d'`e'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'`b'`c'`d'`e'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'`b'`c'`d'`e'
    }
    
    
    local i `++i'
    forvalues f = 1/`polynom' {
    quietly traj, var(`first'`connect'`second') indep(time_1-time_28) model(cnorm) min(0) max(4) order(`a' `b' `c' `d' `e' `f')
    matrix a[`i',1]= e(numGroups1)
    matrix a[`i',2]= `a'`b'`c'`d'`e'`f'
    matrix a[`i',3]= e(BIC_N_data)
    matrix a[`i',4]= e(BIC_n_subjects)
    matrix a[`i',5]= e(AIC)
    matrix a[`i',6]= e(ll)
    if `e(BIC_N_data)' > `bicn' {
    local bicn `e(BIC_N_data)'
    local solutionn `a'`b'`c'`d'`e'`f'
    }
    if `e(BIC_n_subjects)' > `bicp' {
    local bicp `e(BIC_n_subjects)'
    local solutionp `a'`b'`c'`d'`e'`f'
    }
    
    
    
    
    
    
    local i `++i'
    }
    }
    }
    }
    }
    }
    
    
    
    matrix list a
    display "Best solution BIC(n): " `solutionn'
    display "Best solution BIC(p): " `solutionp'
    
    
    * set 
    putexcel set `first'`connect'`second'gr`groups'poly`polynom', replace
    putexcel A1 = matrix(a), names
    I have tried to remove all locals (as many as i can). I also tried to skip the parts where I get the best solution according to BIC. Both with minimal, if none, improvement in time.

    All suggestions are welcome!

    /Jesper



  • #2
    It's almost certainly true that the overwhelming bulk of the computational time is due to the -traj- command itself, and not to anything else in your program. The fact that -traj- is a plug-in (i.e., uses C code) as opposed to a simple ado code indicates that it is very compute intensive. On this ground, I'd suspect that there's not much you can do with your code to speed things up. One thought would be to see if the community contributed command -parallel- can help (-ssc describe parallel-).

    So, how would you verify that the computation time is mostly due to the -traj- command? Take part of one of your loops, and time the whole loop and just the -traj-command. In outline form, it might look like this:

    Code:
    timer clear 1
    timer clear 2
    timer on 1
    forvalues .... {
       timer on 2
       quietly traj ....
       timer off 2
     ...  other stuff
    }
    timer off 1
    di "Total time."
    timer list 1
    di "Just the traj command"
    timer list 2
    You might even be able to get a useful timing with just the the built-in -profiler- command.

    A couple of suggestions for the future: If you want other people to be able to read your code, I'd recommend you use conventional indenting. Also, it's a norm on StataList to indicate where an add-on program came from, which is particularly important for -traj-, since the usual -findit traj- does not reveal it.

    Comment


    • #3
      I am not familiar with the -traj- program, but the mere fact that you have cranked the matrix size up to 11000 suggests to me that it is doing some pretty intense matrix algebra. I would bet that if you profiled the code, you would find that you are spending 99% or more of the time inside -traj-. The various -if- commands and storing things in matrices are extremely fast and no matter how much you streamline them you will not get a noticeable performance boost.

      I think the best you can hope to do is to study the -traj- program's help file (assuming it has one) and see if you can boost performance by specifying some options or some different organization of your data. Or, you could try to "hack" -traj- itself in some ways to save time--that may or may not accomplish anything depending on how tightly it was coded in the first place.

      tl;dr There probably isn't much, if anything, you can do to speed this up.

      Added: Crossed with #2, which provides more or less the same advice with a less pessimistic tone.

      Comment


      • #4
        Thanks both Mike and Clyde. You are both (as always) correct. I timed the different parts of the code. The traj command does take 99% of the time for the loops. Regarding the add-on. Thanks for the etiquette advice, I remember the tips for my next question! I got it from https://www.andrew.cmu.edu/user/bjones/strtxmpl2.htm It is well used in several scientific areas, and it allows multi-trajectory modeling. It performs a form of finite mixture modeling. Really like it, but i need a faster computer...

        Comment


        • #5
          I forgot. The code is (slightliy) modified from https://www.researchgate.net/post/Ho...ing_with_stata , mr Jan Helmdag. Credit where credit is due.

          Comment


          • #6
            Jesper Eriksson and Mike Lacy Hi, please how do you proceed to select the best fit each group in a multi-group based trajectory modeling ? I am looking to straitforward solution in order to identify the best fit for each group simultaneously.
            This is an example of command that I am using:

            traj , multgroups(6) var1(beverage1*) indep1(age_*) model1(beta) order1(3 3 3 3 3 3) ///
            var2(beverage2*) indep2(age_*) model2(zip) order2(3 3 3 3 3 2) iorder2(-1) ///
            var3(beverage3*) indep3(age_*) model3(zip) order3(3 3 3 3 3 1) iorder3(-1) ///
            var4(beverage4*) indep4(age_*) model4(logit) order4(3 3 3 3 3 0)


            Regards.

            Comment


            • #7
              Originally posted by Radhouene DOGGUI View Post
              Jesper Eriksson and Mike Lacy Hi, please how do you proceed to select the best fit each group in a multi-group based trajectory modeling ? I am looking to straitforward solution in order to identify the best fit for each group simultaneously.
              This is an example of command that I am using:

              traj , multgroups(6) var1(beverage1*) indep1(age_*) model1(beta) order1(3 3 3 3 3 3) ///
              var2(beverage2*) indep2(age_*) model2(zip) order2(3 3 3 3 3 2) iorder2(-1) ///
              var3(beverage3*) indep3(age_*) model3(zip) order3(3 3 3 3 3 1) iorder3(-1) ///
              var4(beverage4*) indep4(age_*) model4(logit) order4(3 3 3 3 3 0)


              Regards.
              Hi Radhouene,

              Finding the best model fit is at best a cumbersome procedure and sometimes painfully time-demanding. Especially if you have an advanced model (i.e, many waves/timeperiods, high polynomial order, many trajectory groups etc).

              The usual approach (with pros and cons) is to first find the number of groups an then start model fit. Most researchers uses a simple model with all polynomial orders set to "1". In your example it would be

              For one group:
              Code:
              traj , multgroups(1) var1(beverage1*) indep1(age_*) model1(beta) order1(1) var2(beverage2*) indep2(age_*) model2(zip) order2(1) iorder2(-1) var3(beverage3*) indep3(age_*) model3(zip) order3(1) iorder3(-1) var4(beverage4*) indep4(age_*) model4(logit) order4(1)
              And then try the with two trajectory groups:
              Code:
              traj , multgroups(2) var1(beverage1*) indep1(age_*) model1(beta) order1(1 1) var2(beverage2*) indep2(age_*) model2(zip) order2(1 1) iorder2(-1) var3(beverage3*) indep3(age_*) model3(zip) order3(1 1) iorder3(-1) var4(beverage4*) indep4(age_*) model4(logit) order4(1 1)
              And up to the number of groups you want to examine. Then compare BIC, AIC for example and choose the number of groups you want to proceed with. After you choosen the number of groups, fit the model by iteratively add or withdraw polynomial orders and compare BIC values for example. Then there are other ways of choosing the optimal model and number of groups but this is a common way. Other measures are entropy, Odds of correct classification, not too small groups, Average probability of class membership >0.7 to name some.

              Recommended reading:
              Group-based multi-trajectory modeling, Nagin, D. S. Jones, B. L. Passos, V. L. Tremblay, R. E., 2018, Stat Methods Med Res

              That is a start at least. But you should also read up on the method and Nagins seminal work from 2005. Good luck!

              Comment

              Working...
              X