How to include different numbers of variables in a loop function of regression

Yuki Ishikawa

Join Date: Mar 2019
Posts: 24

How to include different numbers of variables in a loop function of regression

17 Mar 2020, 23:35

I am running logistic regressions storing each of the estimates in order to run multiple likelihood-ration test and generate a matrix for p-value of each likelihood test.
Here is a part of my data:

ID	pheno	cov1	cov2	a_15	a_11	a_3	a_9_f	a_9_y	...	dpb1_229
11	0	0.0071	-0.0285	1	0	0	0	2	...	0
22	1	0.0066	-0.0018	1	0	0	0	2	...	0
23	0	0	0.0085	0	0	0	0.997	0.003	...	0
27	1	0.0165	0.0084	1	1	0	0.999	0.001	...	1
35	0	0.0164	0.0123	0	0	0	0	0	...	0

There are 1,129 columns from a_15 to dpb1_229 indicating different positions.

Using this data, I successfully could run following command for the purpose I mentioned above thanks to the advice I got in a different topic of this forum.

Code:

logit pheno cov1 cov2
est sto E1

local j=2
foreach var of varlist a_15_-dpb1_229{
logit pheno cov1 cov2 `var'
est sto E`j'
local ++j
}

matrix p = J(1130, 1, .)
forvalues i = 2/1130 {
 lrtest E1 E`i', force
 matrix p[`i', 1] = r(p)
}
matrix list p

However, I found that I had to include a_9_f and a_9_y in the same model (logit pheno cov1 cov2 a_9_f a_9_y) in stead of running separately because these two are data of the same position.
They have the same position names, but have additional alphabet characters coding different amino acids.
There are also several similar data in the 1,129 columns, which I also have to deal with in the same manner.
So my question is if there is a way to modify the command above to accomplish this task.

I can also provide additional information about my data if necessary.

I guess this might be a complicated task, but I will really appreciate any suggestions or comments.

Thanks in advance.

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10200

18 Mar 2020, 04:33

You can add exceptions using the if command. Because the loop considers one variable at a time, you will have two of the same regression, but the dimension of your final matrix does not change. Below, I assume that you have only 2 variables which include "a_9_" as part of their variable name. If not, use the -inlist()- function, specifying the full variable names.

Code:

logit pheno cov1 cov2
est sto E1

local j=2
foreach var of varlist a_15_-dpb1_229{
        if regexm("`var'", "a_9_"){
                  logit pheno cov1 cov2 a_9_*
                  est sto E`j'
                  local ++j
         }
         else{
                logit pheno cov1 cov2 `var'
                est sto E`j'
                local ++j
          }            
}
matrix p = J(1130, 1, .)
forvalues i = 2/1130 {
         lrtest E1 E`i', force
         matrix p[`i', 1] = r(p)
}
matrix list p

Last edited by Andrew Musau; 18 Mar 2020, 04:35.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Mar 2020, 09:33

Here is another approach that may work within your existing code. As written, it depends on paired variables appearing next to each other in your dataset.

Code:

ds a_15-dpb1_229
local ivs `r(varlist)'
macro list _ivs
local ivs : subinstr local ivs `"a_9_f a_9_y"' `"a_9_*"'
macro list _ivs
foreach v of local ivs {
    display _newline "===== summarizing `v' ====="
    summarize `v'
}

Code:

. ds a_15-dpb1_229
a_15      a_11      a_3       a_9_f     a_9_y     dpb1_229

. local ivs `r(varlist)'

. macro list _ivs
_ivs:           a_15 a_11 a_3 a_9_f a_9_y dpb1_229

. local ivs : subinstr local ivs `"a_9_f a_9_y"' `"a_9_*"'

. macro list _ivs
_ivs:           a_15 a_11 a_3 a_9_* dpb1_229

. foreach v of local ivs {
  2.         display _newline "===== summarizing `v' ====="
  3.         summarize `v'
  4. }

===== summarizing a_15 =====

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        a_15 |         10   -.1131776     1.05476  -1.836232   1.853389

===== summarizing a_11 =====

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        a_11 |         10    .1351789    1.405627  -2.228199    2.12187

===== summarizing a_3 =====

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         a_3 |         10    .6596058    .9512471  -1.042256    2.16461

===== summarizing a_9_* =====

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       a_9_f |         10   -.5653816    1.222993  -2.268521   2.109213
       a_9_y |         10   -.4161206     .974415  -2.028768   1.061687

===== summarizing dpb1_229 =====

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    dpb1_229 |         10    .0836618    1.102003  -1.651658   2.251146

Comment

Yuki Ishikawa

Join Date: Mar 2019
Posts: 24

18 Mar 2020, 22:56

Hi Andrew,

Thank you so much for your quick reply.
As there are four variables that include "a_9_" as part of their variable name, I replaced "regexm" with "inlist" as you suggested.
However, the loop function stopped with generating error messages as seen in the very last part of the following codes. What wrong with my codes?

Code:

. logit ssb75 cov1 cov2
. est sto E1

. local j=2
. foreach var of varlist a_15-dpb1_229{
  2. if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t"){
  3. logit pheno cov1 cov2 a_9_*
  4. est sto E`j'
  5. local ++j
  6. }
  7. else{
  8. logit pheno cov1 cov2 `var'
  9. est sto E`j'
 10. local ++j
 11. }
 12. }

// the code above generated the following error messages:
note: a_9_t != 0 predicts success perfectly
      a_9_t dropped and 4 obs not used

note: a_9_f != 0 predicts failure perfectly
      a_9_f dropped and 1 obs not used

note: a_9_y != 2 predicts failure perfectly
      a_9_y dropped and 5 obs not used

outcome = cov1 <= .0066 predicts data perfectly
r(2000);

I also attached an updated 'test' file, which is only a part of my real data (467 subjects with 1,139 variables), so that you can have a look at it (I am sorry for a busy table):

id	pheno	cov1	cov2	a_15	a_11	a_3	a_9_f	a_9_y	a_9_s	a_9_t	a_44	a_56	a_62_r	a_62_q	a_63	a_65	a_66	a_67	dpb1_229
11	0	0.0071	-0.0285	1	0	0	0	2	0	0	0	0	1	1	0.997	0	2	0	0.015
22	1	0.0066	-0.0018	1	0	0	0	2	0	0	0	0	1	1	0.997	0	2	0	0
23	0	0	0.0085	0	0	0	0.997	0.003	0	0	1	1	0	0	0	1	0	0	0
27	1	0.0165	0.0084	1	1	0	0.999	0.001	0	1	1	0	0	1	0	0	1	0	0.014
35	0	0.0164	0.0123	0	0	0	0	0	0	0	0	2	0	0	0	2	0	0	0
36	0	0.013	0.0134	0	0	0	0	1	0	0	1	1	0	0	0	1	0	0	0
39	0	-0.0021	-0.0101	1	0	0.013	0	1	0	0	0	1	0	1	0	1	1	0	0.002
49	1	0.0081	0.0171	0	0	0.02	0.027	0.434	0	0.112	0	1	0.551	0.638	0.924	1	1	0	0.014
55	1	0.0015	-0.0151	0	0	0	0.028	1.415	0	0.108	1	1	0	0	0	1	0	0	0
56	0	-0.007	0.0155	1	0	0	0	1	0	0	0	1	0	1	0	1	1	0	0
59	1	-0.0147	0.0139	0	0	0	0.028	1.415	0	0.108	0	1	1	0	0.991	1	1	0	0
63	0	-0.0226	0.0686	1	0	0.009	0	1	0	0	0	1	0	1	0	1	1	0	0.002

As you can see in the data table, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
Indeed, there are a lot of the same type of variables in my real data.
To handle all these variables in the loop code, should I specify the full names of these variables using "inlist" with "|" as follows?

Code:

if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t")|("`var'","a_62_r","a_62_q")

Thanks a lot again,

Last edited by Yuki Ishikawa; 18 Mar 2020, 23:02.

Comment

Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#5

18 Mar 2020, 23:31

Hi William,

Thank you so much for your useful suggestion.
As I posted in the previous reply to Andrew, I updated the data table, which is a part of my real data comprising 467 subjects with 1,139 variables.
Using this data table, I ran the command you suggested with red-highlighted modifications.

Code:

ds a_15-dpb1_229 local ivs `r(varlist)' macro list _ivs local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' macro list _ivs local j=2 foreach v of local ivs { logit pheno cov1 cov2 `v' est sto E`j' local ++j }

Then I got the same error messages as happened in the codes Andrew suggested me.

Code:

note: a_9_t != 0 predicts success perfectly a_9_t dropped and 4 obs not used note: a_9_f != 0 predicts failure perfectly a_9_f dropped and 1 obs not used note: a_9_y != 2 predicts failure perfectly a_9_y dropped and 5 obs not used outcome = cov1 <= .0066 predicts data perfectly r(2000);

Is this error due to my test data itself?

As I mentioned in the previous post, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
Indeed, there are a lot of the same type of variables in my real data.
Do just adding `"a_62_q a_62_r"' `"a_62_*"' right after local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' work?

Code:

local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' `"a_62_q a_62_r"' `"a_62_*"'

I will really appreciate any of your comments or suggestions.

Thanks again,
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10200
#6

19 Mar 2020, 06:30

// the code above generated the following error messages: note: a_9_t != 0 predicts success perfectly a_9_t dropped and 4 obs not used note: a_9_f != 0 predicts failure perfectly a_9_f dropped and 1 obs not used note: a_9_y != 2 predicts failure perfectly a_9_y dropped and 5 obs not used outcome = cov1 <= .0066 predicts data perfectly r(2000);

These errors are not as a result of the code not working as intended. In logistic regression, you need variables that do not predict success or failure perfectly. For example, if the outcome is \(y=1\) if individual \(i\) earns an annual income of over 50,000, and I have a dummy equal to 1 if an individual lives in neighborhood X, then if all individuals who live in this neighborhood have an annual income of more than 50,000, the variable is not useful in helping me predict income>50,000. It predicts success perfectly. There is a lot of discussion on this issue in the forum.

As you can see in the data table, there are two variables (a_62_r and a_62_q) which include "a_62_", which I also want to deal with in the same manner as the variables with "a_9_".
Indeed, there are a lot of the same type of variables in my real data.
To handle all these variables in the loop code, should I specify the full names of these variables using "inlist" with "|" as follows?
Code:
if inlist("`var'","a_9_f","a_9_y","a_9_s","a_9_t")|(" `var'","a_62_r","a_62_q"

I did not mean literally that if the variables are more than 2, you need to include each of their names in full. Initially, you stated that you wanted to include a pair of variables, but in your code, you include more than 2. What is important is how the variables are named. So, if you want to include all variables that have a particular prefix, then there is no need to list all of them. With multiple variables with a common prefix, you can store the prefixes in a local macro.

Code:

*NOTE SPACES BETWEEN PREFIXES local pairs "a_9_ a_62_" local j=2 foreach var of varlist a_15-dpb1_229{ if regexm("`pairs'", substr("`var'", 1, 4)){ local var= regexs(0) logit pheno cov1 cov2 `var'* est sto E`j' local ++j } else{ logit pheno cov1 cov2 `var' est sto E`j' local ++j } }
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

19 Mar 2020, 07:36

Do just adding `"a_62_q a_62_r"' `"a_62_*"' right after local ivs : subinstr local ivs `"a_9_f a_9_y a_9_s a_9_t"' `"a_9_*"' work?

No. You need a separate command line for each substitution you need to make.
Comment

Yuki Ishikawa

Join Date: Mar 2019
Posts: 24

19 Mar 2020, 10:20

Hi Andrew,

Thank you so much again. I am learning a lot in this forum and definitely progressing to the right code.

These errors are not as a result of the code not working as intended. In logistic regression, you need variables that do not predict success or failure perfectly. For example, if the outcome is y=1y=1 if individual ii earns an annual income of over 50,000, and I have a dummy equal to 1 if an individual lives in neighborhood X, then if all individuals who live in this neighborhood have an annual income of more than 50,000, the variable is not useful in helping me predict income>50,000. It predicts success perfectly. There is a lot of discussion on this issue in the forum.

Yes, this is a very very basic issue. Now I completely understood, and solved it just by increasing the number of subjects involved in the analysis (from 12 subjects to 49 subjects).

I am sorry for lack of enough explanation about my data and goal of this analysis. Just in case I would like to describe them again:
The data I attached in my previous reply is quite similar to my real data but with reduced data size, and thus reflect what my real data is like.
The ultimate goal of this analysis is to figure out which position (i.e. a_#, dpb1_#, etc) is the most important for the observed phenotype. That's why I want to include variables of the same position, those with same prefix, in one regression model.

Using the code you gave me in your last reply, I got four identical estimates for a_9_* variables and two identical estimates for a_62_* variables:

I got four multiplicated following regression outcomes.

Code:

Logistic regression                             Number of obs     =         49
                                                LR chi2(5)        =      15.81
                                                Prob &gt; chi2       =     0.0074
Log likelihood = -25.225827                     Pseudo R2         =     0.2386

------------------------------------------------------------------------------
       pheno |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        cov1 |  -64.51361   27.89496    -2.31   0.0207    -119.1867   -9.840498
        cov2 |  -47.50352   17.18793    -2.76   0.0057    -81.19126   -13.81579
       a_9_f |  -.2385295   .8080603    -0.30   0.7679    -1.822299     1.34524
       a_9_y |   .6032165   .6390824     0.94   0.3452     -.649362    1.855795
       a_9_s |          0  (omitted)
       a_9_t |   .7810256   .8636446     0.90   0.3658    -.9116867    2.473738
       _cons |   -1.21602   .8711171    -1.40   0.1627    -2.923378    .4913386
------------------------------------------------------------------------------

note: a_9_s omitted because of collinearity
Iteration 0:   log likelihood =  -33.13297  
Iteration 1:   log likelihood = -25.420908  
Iteration 2:   log likelihood = -25.226902  
Iteration 3:   log likelihood = -25.225827  
Iteration 4:   log likelihood = -25.225827

I got duplicated following regression outcomes.

Code:

Logistic regression                             Number of obs     =         49
                                                LR chi2(4)        =      14.95
                                                Prob &gt; chi2       =     0.0048
Log likelihood = -25.657281                     Pseudo R2         =     0.2256

------------------------------------------------------------------------------
       pheno |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        cov1 |  -63.22537   26.04379    -2.43   0.0152    -114.2703   -12.18048
        cov2 |  -46.58052    17.7182    -2.63   0.0086    -81.30756   -11.85348
      a_62_r |   .5115738   .7811008     0.65   0.5125    -1.019356    2.042503
      a_62_q |  -.3145366   .5474842    -0.57   0.5656    -1.387586    .7585128
       _cons |   -.662315   .5771591    -1.15   0.2512    -1.793526    .4688962
------------------------------------------------------------------------------

Iteration 0:   log likelihood =  -33.13297  
Iteration 1:   log likelihood = -25.727266  
Iteration 2:   log likelihood = -25.657384  
Iteration 3:   log likelihood = -25.657281  
Iteration 4:   log likelihood = -25.657281

As I only need one estimate for each group of variables (each position), I am wondering how I can avoid to get these multiplicated estimates.

I will really appreciate any of the further suggestions or comments.

Thanks a lot!!!

Comment

Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#9

19 Mar 2020, 10:33

Hi William,

Thank you so much for your reply. As I mentioned in my last reply to Andrew, I am learning a lot from both of you.

Originally posted by William Lisowski View Post

No. You need a separate command line for each substitution you need to make.

I got it and this is what I expected.
I will try to make my code line by modifying yours and post it with a complete form.

Thanks again!!!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10200
#10

19 Mar 2020, 13:30

Using the code you gave me in your last reply, I got four identical estimates for a_9_* variables and two identical estimates for a_62_* variables:

I addressed this point in #2:

You can add exceptions using the if command. Because the loop considers one variable at a time, you will have two of the same regression, but the dimension of your final matrix does not change.

Your part #2.

As I only need one estimate for each group of variables (each position), I am wondering how I can avoid to get these multiplicated estimates.

You can replicate the local with prefixes and modify this as you progress.

Code:

*NOTE SPACES BETWEEN PREFIXES local pairs "a_9_ a_62_" local pairs2= "`pairs'" logit pheno cov1 cov2 est sto E1 local j=2 foreach var of varlist a_15_-dpb1_229{ if regexm("`pairs2'", substr("`var'", 1, 4)){ local var= regexs(0) logit pheno cov1 cov2 `var'* est sto E`j' local ++j local pairs2= ustrregexra("`pairs2'","`var'", "",1) } if !regexm("`pairs'", substr("`var'", 1, 4)){ logit pheno cov1 cov2 `var' est sto E`j' local ++j } } matrix p = J(1130, 1, .) forvalues i = 2/1130 { lrtest E1 E`i', force matrix p[`i', 1] = r(p) } matrix list p

2 points to note:

1. "if regexm("`pairs2'", substr("`var'", 1, 4)){" - this crucially relies on the first 4 characters of a variable name identifying the prefix. If this does not hold, you are in trouble.

2. The matrix dimensions now change as a result of your requirement.

Last edited by Andrew Musau; 19 Mar 2020, 13:37.
Comment
Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#11

19 Mar 2020, 20:42

Hi Andrew,

Thank you so much for your kind and precise explanation.
Now the code above beautifully worked and all the estimates are stored as I intended!

I tried to figure out how this entire code work, but still could not understand especially why replicating the local with prefixes and adding the following can change the matrix dimensions as I desired.

Code:

local pairs2= ustrregexra("`pairs2'","`var'", "",1)

Actually I tried the program without replicating the local, and I got the following error.

Code:

a_9_ ambiguous abbreviation r(111);

Of course, I have not still understood why I got the error above.

Could you please add further explanation about how differently the latter program work compared to that without replication of the local?

Many many thanks
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10200
#12

20 Mar 2020, 02:23

Actually I tried the program without replicating the local, and I got the following error.
Code:
a_9_ ambiguous abbreviation r(111);
Of course, I have not still understood why I got the error above.

Can you show the exact command that you ran resulting in this error?
Comment

Yuki Ishikawa

Join Date: Mar 2019
Posts: 24

#13

20 Mar 2020, 06:39

Hi Andrew,

Thank you so much for your reply.

This is what I just tried to figure out how your right code work properly.

Code:

. local pairs "a_9_ a_62_"

. local j=2

. foreach var of varlist a_15-dpb1_229{
  2.         if regexm("`pairs'", substr("`var'", 1, 4)){
  3.                 local var= regexs(0)
  4.                 logit pheno cov1 cov2 `var'*
  5.                 est sto E`j'
  6.                 local ++j
  7.                 local pairs= ustrregexra("`pairs'","`var'", "",.)
  8.          }
  9.          if !regexm("`pairs'", substr("`var'", 1, 4)){
 10.                 logit pheno cov1 cov2 `var'
 11.                 est sto E`j'
 12.                 local ++j
 13. }
 14. }

Then I got the following error.

Code:

a_9_ ambiguous abbreviation
r(111);

Thanks a lot,

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10200
#14

20 Mar 2020, 10:09

Thanks for the code. Here is the sequence:

1. There are four variables with the prefix "a_9_", i.e., a_9_f, a_9_y, a_9_s, a_9_t.
2. The first time we run the code and encounter one of these variables, e.g., a_9_f, the first if condition is fulfilled as the line below is true (evaluates to 1)

Code:

if regexm("`pairs'", substr("`var'", 1, 4)){

i.e., regexm("a_9_", substr("a_9_f", 1, 4))==1, see below

Code:

. di regexm("a_9_", substr("a_9_f", 1, 4)) 1

3. regexs(0)= a_9_ and therefore, we have defined var=a_9_ below

Code:

local var= regexs(0)

4. This logit regression will run just fine and you will obtain the first estimate (substituting var=a_9_)

Code:

logit pheno cov1 cov2 a_9_*

5. Now, the following deletes a_9_ from the local containing prefexes as we want to run only one regression with this prefix

Code:

local pairs= ustrregexra("`pairs'","`var'", "",.)

Literally, we have

Code:

. di ustrregexra("a_9_ a_62","a_9_", "",.) a_62

6. Stata has no memory, so it will skip to the next iteration of the loop. Now we no longer have the prefix a_9_ and assume that the next variable in order is a_9_y. The first if condition is not true, so Stata will proceed to the second. The regression here is

Code:

logit pheno cov1 cov2 `var'

But remember, we defined var= regexs(0)= a_9_ previously. So Stata will read this as

Code:

logit pheno cov1 cov2 a_9_

We still have 4 variables with the prefix a_9_ (a_9_f, a_9_y, a_9_s, a_9_t), so now we are confusing Stata. It will terminate by displaying the error

a_9_ ambiguous abbreviation
r(111);

So there you have it. My way of resolving this is to keep a local that contains all the prefixes so that the second if condition is never fulfilled if a variable has a prefix that is within this list of prefixes.
1 like
Comment
Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#15

20 Mar 2020, 19:47

Hi Andrew,

Again, I really appreciate your kind and precise explanation. Now I completely understood how the error happened and your code worked properly.
It looked like a magic at first, but now it is a very logical way of handling the variables!

I cannot thank you enough!
Comment

Announcement

How to include different numbers of variables in a loop function of regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment