Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to include different numbers of variables in a loop function of regression

    I am running logistic regressions storing each of the estimates in order to run multiple likelihood-ration test and generate a matrix for p-value of each likelihood test.
    Here is a part of my data:
    ID pheno cov1 cov2 a_15 a_11 a_3 a_9_f a_9_y ... dpb1_229
    11 0 0.0071 -0.0285 1 0 0 0 2 ... 0
    22 1 0.0066 -0.0018 1 0 0 0 2 ... 0
    23 0 0 0.0085 0 0 0 0.997 0.003 ... 0
    27 1 0.0165 0.0084 1 1 0 0.999 0.001 ... 1
    35 0 0.0164 0.0123 0 0 0 0 0 ... 0
    There are 1,129 columns from a_15 to dpb1_229 indicating different positions.

    Using this data, I successfully could run following command for the purpose I mentioned above thanks to the advice I got in a different topic of this forum.

    Code:
    logit pheno cov1 cov2
    est sto E1
    
    local j=2
    foreach var of varlist a_15_-dpb1_229{
    logit pheno cov1 cov2 `var'
    est sto E`j'
    local ++j
    }
    
    matrix p = J(1130, 1, .)
    forvalues i = 2/1130 {
     lrtest E1 E`i', force
     matrix p[`i', 1] = r(p)
    }
    matrix list p
    However, I found that I had to include a_9_f and a_9_y in the same model (logit pheno cov1 cov2 a_9_f a_9_y) in stead of running separately because these two are data of the same position.
    They have the same position names, but have additional alphabet characters coding different amino acids.
    There are also several similar data in the 1,129 columns, which I also have to deal with in the same manner.
    So my question is if there is a way to modify the command above to accomplish this task.

    I can also provide additional information about my data if necessary.

    I guess this might be a complicated task, but I will really appreciate any suggestions or comments.

    Thanks in advance.

  • #2
    You can add exceptions using the if command. Because the loop considers one variable at a time, you will have two of the same regression, but the dimension of your final matrix does not change. Below, I assume that you have only 2 variables which include "a_9_" as part of their variable name. If not, use the -inlist()- function, specifying the full variable names.

    Code:
    logit pheno cov1 cov2
    est sto E1
    
    local j=2
    foreach var of varlist a_15_-dpb1_229{
            if regexm("`var'", "a_9_"){
                      logit pheno cov1 cov2 a_9_*
                      est sto E`j'
                      local ++j
             }
             else{
                    logit pheno cov1 cov2 `var'
                    est sto E`j'
                    local ++j
              }            
    }
    matrix p = J(1130, 1, .)
    forvalues i = 2/1130 {
             lrtest E1 E`i', force
             matrix p[`i', 1] = r(p)
    }
    matrix list p
    Last edited by Andrew Musau; 18 Mar 2020, 04:35.

    Comment


    • #3
      Here is another approach that may work within your existing code. As written, it depends on paired variables appearing next to each other in your dataset.
      Code:
      ds a_15-dpb1_229
      local ivs `r(varlist)'
      macro list _ivs
      local ivs : subinstr local ivs `"a_9_f a_9_y"' `"a_9_*"'
      macro list _ivs
      foreach v of local ivs {
          display _newline "===== summarizing `v' ====="
          summarize `v'
      }
      Code:
      . ds a_15-dpb1_229
      a_15      a_11      a_3       a_9_f     a_9_y     dpb1_229
      
      . local ivs `r(varlist)'
      
      . macro list _ivs
      _ivs:           a_15 a_11 a_3 a_9_f a_9_y dpb1_229
      
      . local ivs : subinstr local ivs `"a_9_f a_9_y"' `"a_9_*"'
      
      . macro list _ivs
      _ivs:           a_15 a_11 a_3 a_9_* dpb1_229
      
      . foreach v of local ivs {
        2.         display _newline "===== summarizing `v' ====="
        3.         summarize `v'
        4. }
      
      ===== summarizing a_15 =====
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
              a_15 |         10   -.1131776     1.05476  -1.836232   1.853389
      
      ===== summarizing a_11 =====
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
              a_11 |         10    .1351789    1.405627  -2.228199    2.12187
      
      ===== summarizing a_3 =====
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
               a_3 |         10    .6596058    .9512471  -1.042256    2.16461
      
      ===== summarizing a_9_* =====
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             a_9_f |         10   -.5653816    1.222993  -2.268521   2.109213
             a_9_y |         10   -.4161206     .974415  -2.028768   1.061687
      
      ===== summarizing dpb1_229 =====
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
          dpb1_229 |         10    .0836618    1.102003  -1.651658   2.251146

      Comment


      • #4
        Hi Andrew,

        Thank you so much for your quick reply.
        As there are four variables that include "a_9_" as part of their variable name, I replaced "regexm" with "inlist" as you suggested.
        However, the loop function stopped with generating error messages as seen in the very last part of the following codes. What wrong with my codes?
        Code:
        . logit ssb75 cov1 cov2
        . est sto E1
        
        . local j=2
        . foreach var of varlist a_15-dpb1_229{
          2. if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t"){
          3. logit pheno cov1 cov2 a_9_*
          4. est sto E`j'
          5. local ++j
          6. }
          7. else{
          8. logit pheno cov1 cov2 `var'
          9. est sto E`j'
         10. local ++j
         11. }
         12. }
        
        // the code above generated the following error messages:
        note: a_9_t != 0 predicts success perfectly
              a_9_t dropped and 4 obs not used
        
        note: a_9_f != 0 predicts failure perfectly
              a_9_f dropped and 1 obs not used
        
        note: a_9_y != 2 predicts failure perfectly
              a_9_y dropped and 5 obs not used
        
        outcome = cov1 <= .0066 predicts data perfectly
        r(2000);
        I also attached an updated 'test' file, which is only a part of my real data (467 subjects with 1,139 variables), so that you can have a look at it (I am sorry for a busy table):
        id pheno cov1 cov2 a_15 a_11 a_3 a_9_f a_9_y a_9_s a_9_t a_44 a_56 a_62_r a_62_q a_63 a_65 a_66 a_67 dpb1_229
        11 0 0.0071 -0.0285 1 0 0 0 2 0 0 0 0 1 1 0.997 0 2 0 0.015
        22 1 0.0066 -0.0018 1 0 0 0 2 0 0 0 0 1 1 0.997 0 2 0 0
        23 0 0 0.0085 0 0 0 0.997 0.003 0 0 1 1 0 0 0 1 0 0 0
        27 1 0.0165 0.0084 1 1 0 0.999 0.001 0 1 1 0 0 1 0 0 1 0 0.014
        35 0 0.0164 0.0123 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0
        36 0 0.013 0.0134 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0
        39 0 -0.0021 -0.0101 1 0 0.013 0 1 0 0 0 1 0 1 0 1 1 0 0.002
        49 1 0.0081 0.0171 0 0 0.02 0.027 0.434 0 0.112 0 1 0.551 0.638 0.924 1 1 0 0.014
        55 1 0.0015 -0.0151 0 0 0 0.028 1.415 0 0.108 1 1 0 0 0 1 0 0 0
        56 0 -0.007 0.0155 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0
        59 1 -0.0147 0.0139 0 0 0 0.028 1.415 0 0.108 0 1 1 0 0.991 1 1 0 0
        63 0 -0.0226 0.0686 1 0 0.009 0 1 0 0 0 1 0 1 0 1 1 0 0.002
        As you can see in the data table, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
        Indeed, there are a lot of the same type of variables in my real data.
        To handle all these variables in the loop code, should I specify the full names of these variables using "inlist" with "|" as follows?

        Code:
        if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t")|("`var'","a_62_r","a_62_q")
        Thanks a lot again,
        Last edited by Yuki Ishikawa; 18 Mar 2020, 23:02.

        Comment


        • #5
          Hi William,

          Thank you so much for your useful suggestion.
          As I posted in the previous reply to Andrew, I updated the data table, which is a part of my real data comprising 467 subjects with 1,139 variables.
          Using this data table, I ran the command you suggested with red-highlighted modifications.

          Code:
          ds a_15-dpb1_229
          local ivs `r(varlist)'
          macro list _ivs
          local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"'
          macro list _ivs
          local j=2
          foreach v of local ivs {
          logit pheno cov1 cov2 `v'
                          est sto E`j'
                          local ++j
          }
          Then I got the same error messages as happened in the codes Andrew suggested me.
          Code:
          note: a_9_t != 0 predicts success perfectly
                a_9_t dropped and 4 obs not used
          
          note: a_9_f != 0 predicts failure perfectly
                a_9_f dropped and 1 obs not used
          
          note: a_9_y != 2 predicts failure perfectly
                a_9_y dropped and 5 obs not used
          
          outcome = cov1 <= .0066 predicts data perfectly
          r(2000);
          Is this error due to my test data itself?

          As I mentioned in the previous post, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
          Indeed, there are a lot of the same type of variables in my real data.
          Do just adding `"a_62_q a_62_r"' `"a_62_*"' right after local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' work?
          Code:
          local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' `"a_62_q a_62_r"' `"a_62_*"'
          I will really appreciate any of your comments or suggestions.

          Thanks again,

          Comment


          • #6
            // the code above generated the following error messages: note: a_9_t != 0 predicts success perfectly a_9_t dropped and 4 obs not used note: a_9_f != 0 predicts failure perfectly a_9_f dropped and 1 obs not used note: a_9_y != 2 predicts failure perfectly a_9_y dropped and 5 obs not used outcome = cov1 <= .0066 predicts data perfectly r(2000);
            These errors are not as a result of the code not working as intended. In logistic regression, you need variables that do not predict success or failure perfectly. For example, if the outcome is \(y=1\) if individual \(i\) earns an annual income of over 50,000, and I have a dummy equal to 1 if an individual lives in neighborhood X, then if all individuals who live in this neighborhood have an annual income of more than 50,000, the variable is not useful in helping me predict income>50,000. It predicts success perfectly. There is a lot of discussion on this issue in the forum.

            As you can see in the data table, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
            Indeed, there are a lot of the same type of variables in my real data.
            To handle all these variables in the loop code, should I specify the full names of these variables using "inlist" with "|" as follows?
            Code:
            if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t")|(" `var'","a_62_r","a_62_q"
            I did not mean literally that if the variables are more than 2, you need to include each of their names in full. Initially, you stated that you wanted to include a pair of variables, but in your code, you include more than 2. What is important is how the variables are named. So, if you want to include all variables that have a particular prefix, then there is no need to list all of them. With multiple variables with a common prefix, you can store the prefixes in a local macro.

            Code:
            *NOTE SPACES BETWEEN PREFIXES
            local pairs "a_9_ a_62_"
            
            local j=2
            foreach var of varlist a_15-dpb1_229{
                   if regexm("`pairs'", substr("`var'", 1, 4)){
                           local var= regexs(0)
                           logit pheno cov1 cov2 `var'*
                           est sto E`j'
                           local ++j
                           }
                   else{
                           logit pheno cov1 cov2 `var'
                           est sto E`j'
                           local ++j
                           }            
            }




            Comment


            • #7
              Do just adding `"a_62_q a_62_r"' `"a_62_*"' right after local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' work?
              No. You need a separate command line for each substitution you need to make.

              Comment


              • #8
                Hi Andrew,

                Thank you so much again. I am learning a lot in this forum and definitely progressing to the right code.

                These errors are not as a result of the code not working as intended. In logistic regression, you need variables that do not predict success or failure perfectly. For example, if the outcome is y=1y=1 if individual ii earns an annual income of over 50,000, and I have a dummy equal to 1 if an individual lives in neighborhood X, then if all individuals who live in this neighborhood have an annual income of more than 50,000, the variable is not useful in helping me predict income&gt;50,000. It predicts success perfectly. There is a lot of discussion on this issue in the forum.
                Yes, this is a very very basic issue. Now I completely understood, and solved it just by increasing the number of subjects involved in the analysis (from 12 subjects to 49 subjects).

                I am sorry for lack of enough explanation about my data and goal of this analysis. Just in case I would like to describe them again:
                The data I attached in my previous reply is quite similar to my real data but with reduced data size, and thus reflect what my real data is like.
                The ultimate goal of this analysis is to figure out which position (i.e. a_#, dpb1_#, etc) is the most important for the observed phenotype. That's why I want to include variables of the same position, those with same prefix, in one regression model.

                Using the code you gave me in your last reply, I got four identical estimates for a_9_* variables and two identical estimates for a_62_* variables:

                I got four multiplicated following regression outcomes.
                Code:
                Logistic regression                             Number of obs     =         49
                                                                LR chi2(5)        =      15.81
                                                                Prob &gt; chi2       =     0.0074
                Log likelihood = -25.225827                     Pseudo R2         =     0.2386
                
                ------------------------------------------------------------------------------
                       pheno |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                        cov1 |  -64.51361   27.89496    -2.31   0.0207    -119.1867   -9.840498
                        cov2 |  -47.50352   17.18793    -2.76   0.0057    -81.19126   -13.81579
                       a_9_f |  -.2385295   .8080603    -0.30   0.7679    -1.822299     1.34524
                       a_9_y |   .6032165   .6390824     0.94   0.3452     -.649362    1.855795
                       a_9_s |          0  (omitted)
                       a_9_t |   .7810256   .8636446     0.90   0.3658    -.9116867    2.473738
                       _cons |   -1.21602   .8711171    -1.40   0.1627    -2.923378    .4913386
                ------------------------------------------------------------------------------
                
                note: a_9_s omitted because of collinearity
                Iteration 0:   log likelihood =  -33.13297  
                Iteration 1:   log likelihood = -25.420908  
                Iteration 2:   log likelihood = -25.226902  
                Iteration 3:   log likelihood = -25.225827  
                Iteration 4:   log likelihood = -25.225827
                I got duplicated following regression outcomes.
                Code:
                Logistic regression                             Number of obs     =         49
                                                                LR chi2(4)        =      14.95
                                                                Prob &gt; chi2       =     0.0048
                Log likelihood = -25.657281                     Pseudo R2         =     0.2256
                
                ------------------------------------------------------------------------------
                       pheno |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                        cov1 |  -63.22537   26.04379    -2.43   0.0152    -114.2703   -12.18048
                        cov2 |  -46.58052    17.7182    -2.63   0.0086    -81.30756   -11.85348
                      a_62_r |   .5115738   .7811008     0.65   0.5125    -1.019356    2.042503
                      a_62_q |  -.3145366   .5474842    -0.57   0.5656    -1.387586    .7585128
                       _cons |   -.662315   .5771591    -1.15   0.2512    -1.793526    .4688962
                ------------------------------------------------------------------------------
                
                Iteration 0:   log likelihood =  -33.13297  
                Iteration 1:   log likelihood = -25.727266  
                Iteration 2:   log likelihood = -25.657384  
                Iteration 3:   log likelihood = -25.657281  
                Iteration 4:   log likelihood = -25.657281
                As I only need one estimate for each group of variables (each position), I am wondering how I can avoid to get these multiplicated estimates.

                I will really appreciate any of the further suggestions or comments.

                Thanks a lot!!!

                Comment


                • #9
                  Hi William,

                  Thank you so much for your reply. As I mentioned in my last reply to Andrew, I am learning a lot from both of you.

                  Originally posted by William Lisowski View Post
                  No. You need a separate command line for each substitution you need to make.
                  I got it and this is what I expected.
                  I will try to make my code line by modifying yours and post it with a complete form.

                  Thanks again!!!

                  Comment


                  • #10
                    Using the code you gave me in your last reply, I got four identical estimates for a_9_* variables and two identical estimates for a_62_* variables:
                    I addressed this point in #2:

                    You can add exceptions using the if command. Because the loop considers one variable at a time, you will have two of the same regression, but the dimension of your final matrix does not change.
                    Your part #2.

                    As I only need one estimate for each group of variables (each position), I am wondering how I can avoid to get these multiplicated estimates.
                    You can replicate the local with prefixes and modify this as you progress.

                    Code:
                    *NOTE SPACES BETWEEN PREFIXES
                    local pairs "a_9_ a_62_"
                    local pairs2= "`pairs'"
                    
                    logit pheno cov1 cov2
                    est sto E1
                    
                    local j=2
                    foreach var of varlist a_15_-dpb1_229{
                            if regexm("`pairs2'", substr("`var'", 1, 4)){
                                    local var= regexs(0)
                                    logit pheno cov1 cov2 `var'*
                                    est sto E`j'
                                    local ++j
                                    local pairs2= ustrregexra("`pairs2'","`var'", "",1)
                             }
                             if !regexm("`pairs'", substr("`var'", 1, 4)){
                                    logit pheno cov1 cov2 `var'
                                    est sto E`j'
                                    local ++j
                              }            
                    }
                    matrix p = J(1130, 1, .)
                    forvalues i = 2/1130 {
                             lrtest E1 E`i', force
                             matrix p[`i', 1] = r(p)
                    }
                    matrix list p
                    2 points to note:

                    1. "if regexm("`pairs2'", substr("`var'", 1, 4)){" - this crucially relies on the first 4 characters of a variable name identifying the prefix. If this does not hold, you are in trouble.

                    2. The matrix dimensions now change as a result of your requirement.
                    Last edited by Andrew Musau; 19 Mar 2020, 13:37.

                    Comment


                    • #11
                      Hi Andrew,

                      Thank you so much for your kind and precise explanation.
                      Now the code above beautifully worked and all the estimates are stored as I intended!

                      I tried to figure out how this entire code work, but still could not understand especially why replicating the local with prefixes and adding the following can change the matrix dimensions as I desired.
                      Code:
                      local pairs2= ustrregexra("`pairs2'","`var'", "",1)
                      Actually I tried the program without replicating the local, and I got the following error.
                      Code:
                      a_9_ ambiguous abbreviation
                      r(111);
                      Of course, I have not still understood why I got the error above.

                      Could you please add further explanation about how differently the latter program work compared to that without replication of the local?

                      Many many thanks

                      Comment


                      • #12
                        Actually I tried the program without replicating the local, and I got the following error.
                        Code:
                        a_9_ ambiguous abbreviation r(111);
                        Of course, I have not still understood why I got the error above.
                        Can you show the exact command that you ran resulting in this error?

                        Comment


                        • #13
                          Hi Andrew,

                          Thank you so much for your reply.

                          This is what I just tried to figure out how your right code work properly.
                          Code:
                          . local pairs "a_9_ a_62_"
                          
                          . local j=2
                          
                          . foreach var of varlist a_15-dpb1_229{
                            2.         if regexm("`pairs'", substr("`var'", 1, 4)){
                            3.                 local var= regexs(0)
                            4.                 logit pheno cov1 cov2 `var'*
                            5.                 est sto E`j'
                            6.                 local ++j
                            7.                 local pairs= ustrregexra("`pairs'","`var'", "",.)
                            8.          }
                            9.          if !regexm("`pairs'", substr("`var'", 1, 4)){
                           10.                 logit pheno cov1 cov2 `var'
                           11.                 est sto E`j'
                           12.                 local ++j
                           13. }
                           14. }
                          Then I got the following error.
                          Code:
                          a_9_ ambiguous abbreviation
                          r(111);
                          Thanks a lot,

                          Comment


                          • #14
                            Thanks for the code. Here is the sequence:

                            1. There are four variables with the prefix "a_9_", i.e., a_9_f, a_9_y, a_9_s, a_9_t.
                            2. The first time we run the code and encounter one of these variables, e.g., a_9_f, the first if condition is fulfilled as the line below is true (evaluates to 1)

                            Code:
                            if regexm("`pairs'", substr("`var'", 1, 4)){
                            i.e., regexm("a_9_", substr("a_9_f", 1, 4))==1, see below

                            Code:
                            . di regexm("a_9_", substr("a_9_f", 1, 4))
                            1
                            3. regexs(0)= a_9_ and therefore, we have defined var=a_9_ below

                            Code:
                            local var= regexs(0)
                            4. This logit regression will run just fine and you will obtain the first estimate (substituting var=a_9_)

                            Code:
                            logit pheno cov1 cov2 a_9_*
                            5. Now, the following deletes a_9_ from the local containing prefexes as we want to run only one regression with this prefix

                            Code:
                            local pairs= ustrregexra("`pairs'","`var'", "",.)
                            Literally, we have

                            Code:
                            . di ustrregexra("a_9_ a_62","a_9_", "",.)
                             a_62
                            6. Stata has no memory, so it will skip to the next iteration of the loop. Now we no longer have the prefix a_9_ and assume that the next variable in order is a_9_y. The first if condition is not true, so Stata will proceed to the second. The regression here is

                            Code:
                            logit pheno cov1 cov2 `var'
                            But remember, we defined var= regexs(0)= a_9_ previously. So Stata will read this as

                            Code:
                            logit pheno cov1 cov2 a_9_
                            We still have 4 variables with the prefix a_9_ (a_9_f, a_9_y, a_9_s, a_9_t), so now we are confusing Stata. It will terminate by displaying the error

                            a_9_ ambiguous abbreviation
                            r(111);
                            So there you have it. My way of resolving this is to keep a local that contains all the prefixes so that the second if condition is never fulfilled if a variable has a prefix that is within this list of prefixes.








                            Comment


                            • #15
                              Hi Andrew,

                              Again, I really appreciate your kind and precise explanation. Now I completely understood how the error happened and your code worked properly.
                              It looked like a magic at first, but now it is a very logical way of handling the variables!

                              I cannot thank you enough!

                              Comment

                              Working...
                              X