How to include different numbers of variables in a loop function of regression

Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#16

21 Mar 2020, 15:53

I still have a question.

As in the table I posted in my previous reply, some variables have names, like dpb1_229, of which prefixes are longer than those, such as a_9_f or a_62_g.
In the code Andrew created, the length of prefixes is "4".

Code:

if regexm("`pairs'", substr("`var'", 1, 4)){

In my real data, I have variables with same prefixes, such as dpb1_170_t/dpb1_170_y or dpb1_178_s/dpb1_178_x/dpb1_178_s, so I included these prefixes in the initial local code "pairs" as follows.

Code:

local pairs "a_9_ a_62_ a_76_ a_95_ a97_ a_99_ a_114_ a_116_ a_152_ a_156_ c_275_ c_156_ c_152_ c116_ /// c_99_ c_95_ c_9_ b_325_ b_305_ b_282_ b_163_ b_156_ b_116_ b_114_ b_99_ b_97_ b_95_ 80_ b_77_ b_70_ b_69_ /// b_67_ b_66_ b_45_ b_24_ b_z8_ b_z10_ b_z16_ b_z21_ b_z23_ drb1_233_ drb1_231_ drb1_189_ drb1_181_ drb1_180_ /// drb1_166_ drb1_149_ drb1_142_ drb1_140_ drb1_133_ drb1_120_ drb1_112_ drb1_104_ drb1_98_ drb1_96_ drb1_74_ /// drb1_71_ drb1_70_ drb1_67_ drb1_60_ drb1_57_ drb1_38_ drb1_37_ drb1_31_ drb1_30_ drb1_28_ drb1_26_ drb1_16_ /// drb1_13_ drb1_11_ drb1_10_ drb1_9_ drb1_4_ drb1_z1_ drb1_z16_ drb1_z17_ drb1_z24_ drb1_z25_ dqb1_224_ dqb1_221_ /// dqb1_220_ dqb1_203_ dqb1_197_ dqb1_185_ dqb1_182_ dqb1_167_ dqb1_140_ dqb1_130_ dqb1_126_ dqb1_125_ /// dqb1_116_ dqb1_87_ dqb1_86_ dqb1_74_ dqb1_71_ dqb1_70_ dqb1_57_ dqb1_55_ dqb1_37_ dqb1_30_ dqb1_26_ /// dqb1_9_ dqb1_3_ dqb1_z4_ dqb1_z5_ dqb1_z6_ dqb1_z9_ dqb1_z10_ dqb1_z17_ dqb1_z18_ dqb1_z21_ dqb1_z27_ /// dpb1_9_ dpb1_35_ dpb1_55_ dpb1_65_ dpb1_76_ dpb1_96_ dpb1_170_ dpb1_178_ dpb1_205_"

When I ran the program with the length of prefixes 4, I found that Stata could not distinguish some prefixes, such as a_114_ and a_116_, and thus I changed the length of prefixes from 4 to 5.
However, Stata still could not distinguish longer prefixes, such as dpb1_170_ and dpb1_178_, and the result regression included many variables with prefixes dqb1_, drb1_, and dpb1_.
On the other hand, if I changed the length of prefixes from 4 to 8, Stata could not properly handle variables with shorter prefixes, such as a_9_ or a_62_; variables with these prefixes were separately included in regressions.

So, my question is if there are any ways to tell Stata to handle all prefixes listed above properly.

Any comments and suggestions will be highly appreciated.

Last edited by Yuki Ishikawa; 21 Mar 2020, 16:00.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10203
#17

21 Mar 2020, 23:13

If all your variable names end in a single letter, e.g., dpb1_178_a or dpb1_170_q and not something like a_62_zz or a_114_abc, then, in the code in #10, replace the line

Code:

if regexm("`pairs'", substr("`var'", 1, 4)){

with

Code:

if regexm("`pairs'", substr("`var'", 1, length("`var'")-1)){

Otherwise, show all possible combinations of your variable names.
Comment

Yuki Ishikawa

Join Date: Mar 2019
Posts: 24

#18

22 Mar 2020, 22:51

Hi Andrew,

Thank you very much for your reply.

All of my variables end with or without a single letter (dpb1_205_m or dpb1_229_), I thought replacing with your code would definitely work.
Below is my current complete command for my real data:

Code:

local pairs "a_9_ a_62_ a_76_ a_95_ a_97_ a_99_ a_114_ a_116_ a_152_ a_156_ c_275_ c_156_ c_152_ c_116_ c_99_ c_95_ c_9_ b_325_ b_305_ b_282_ b_163_ b_156_ b_116_ b_114_ b_99_ b_97_ b_95_ b_80_ b_77_ b_70_ b_69_ b_67_ b_66_ b_45_ b_24_ b_z8_ b_z10_ b_z11_ b_z16_ b_z21_ b_z23_ drb1_233_ drb1_231_ drb1_189_ drb1_181_ drb1_180_ drb1_166_ drb1_149_ drb1_142_ drb1_140_ drb1_133_ drb1_120_ drb1_112_ drb1_104_ drb1_98_ drb1_96_ drb1_74_ drb1_71_ drb1_70_ drb1_67_ drb1_60_ drb1_57_ drb1_38_ drb1_37_ drb1_31_ drb1_30_ drb1_28_ drb1_26_ drb1_16_ drb1_13_ drb1_11_ drb1_10_ drb1_9_ drb1_4_ drb1_z1_ drb1_z16_ drb1_z17_ drb1_z24_ drb1_z25_ dqb1_224_ dqb1_221_ dqb1_220_ dqb1_203_ dqb1_197_ dqb1_185_ dqb1_182_ dqb1_167_ dqb1_140_ dqb1_130_ dqb1_126_ dqb1_125_ dqb1_116_ dqb1_87_ dqb1_86_ dqb1_74_ dqb1_71_ dqb1_70_ dqb1_57_ dqb1_55_ dqb1_37_ dqb1_30_ dqb1_26_ dqb1_9_ dqb1_3_ dqb1_z4_ dqb1_z5_ dqb1_z6_ dqb1_z9_ dqb1_z10_ dqb1_z17_ dqb1_z18_ dqb1_z21_ dqb1_z27_ dpb1_9_ dpb1_35_ dpb1_55_ dpb1_65_ dpb1_76_ dpb1_96_ dpb1_170_ dpb1_178_ dpb1_205_"
local pairs2= "`pairs'"

logit ssb75 cov1 cov2 cov3 cov4 cov5 cov6 cov7 cov8 cov9 cov10
est sto E1

local j=2
foreach var of varlist a_z15_-dpb1_229_{
if regexm("`pairs2'", substr("`var'", 1, length("`var'")-1)){
local var= regexs(0)
capture noisily logit ssb75 cov1 cov2 cov3 cov4 cov5 cov6 cov7 cov8 cov9 cov10 `var'*
est sto E`j'
local ++j
local pairs2= ustrregexra("`pairs2'","`var'", "",1)
}
if !regexm("`pairs'", substr("`var'", 1, length("`var'")-1)){
capture noisily logit ssb75 cov1 cov2 cov3 cov4 cov5 cov6 cov7 cov8 cov9 cov10 `var'
est sto E`j'
local ++j
}
}

matrix p = J(1130, 1, .) 
forvalues i = 2/1130 {
         lrtest E1 E`i', force
         matrix p[`i', 1] = r(p)
}

matrix list p

However, as you can see in tables below, I still could not teach Stata to handle variables as I desired.

Code:

                        
ssb75    Coef.    Std. Err.    z    P>z    [95% Conf.    Interval]            
cov1    -25.65584    14.73288    -1.74    0.082    -54.53174    3.220067
 …
cov10    28.54345    16.41228    1.74    0.082    -3.624021    60.71093
b_z1_    0    (omitted)
b_z10_a    594.2929    426.5589    1.39    0.164    -241.7472    1430.333
b_z10_g    -593.0263    426.6178    -1.39    0.165    -1429.182    243.1291
b_z11_s    36.94893    183.1397    0.20    0.840    -321.9983    395.8961
b_z11_w    -37.3035    183.2688    -0.20    0.839    -396.5038    321.8968
b_z12_    371.7737    267.4475    1.39    0.165    -152.4138    895.9611
b_z13_    -812.672    684.5753    -1.19    0.235    -2154.415    529.0709
b_z14_    -326.9258    281.9817    -1.16    0.246    -879.5997    225.7482
b_z15_    25.65838    60.33897    0.43    0.671    -92.60383    143.9206
b_z16_v    -47.06698    145.4012    -0.32    0.746    -332.0482    237.9142
b_z16_l    46.58823    145.4456    0.32    0.749    -238.4799    331.6563
b_z17_    0    (omitted)
b_z18_    0    (omitted)
b_z19_    0    (omitted)
_cons    -1.020381    .8437754    -1.21    0.227    -2.67415    .6333888
// "z" means minus(-)

Code:

ssb75    Coef.    Std. Err.    z    P>z    [95% Conf.    Interval]                      
cov1    -27.73629    13.75522    -2.02    0.044    -54.69602    -.7765647
 …
cov10    10.94523    13.41263    0.82    0.414    -15.34304    37.2335
b_z2_    0    (omitted)
b_z20_    0    (omitted)
b_z21_m    5674.548    945.6718    6.00    0.000    3821.065    7528.031
b_z21_t    -7000.77    .    .    .    .    .
b_z22_    -4889.674    3922.945    -1.25    0.213    -12578.5    2799.157
b_z23_l    0    (omitted)
b_z23_r    1326.22    945.6763    1.40    0.161    -527.2714    3179.712
b_z24_    0    (omitted)
_cons    -1.222578    .3103971    -3.94    0.000    -1.830945    -.6142108
// "z" means minus(-)

I am wondering if there are any ways to further modify the command above.

Any comments and suggestions will be appreciated.

Thanks a lot,

Last edited by Yuki Ishikawa; 22 Mar 2020, 23:20.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10203
#19

23 Mar 2020, 09:46

As I stated in #17, the following

Code:

if regexm("`pairs'", substr("`var'", 1, length("`var'")-1)){

will work provided that a prefix is identified by the full variable name excluding the last character (letter). I do not learn much from your response in #18. What I need is for you to show me exceptions to this rule. For example, you should list a group of variables, stating that that these variables should be grouped together, even though a couple of them do not follow the rule stated above.
Comment
Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#20

23 Mar 2020, 21:36

Hi Andrew,

Thank you so much for your reply and I am sorry for lack of information in #18.

The variables I have are those end in a single letter or without a letter, and no exception for all the variables I have.
Since Stata does not accept "- (minus)", I replaced "-" with "z" (e,g, z21 means -21 and z23 means -23), but this does not matter I think.
For my analysis, variables with the same prefix number should be grouped as follows:
a_z1_, (a_z10_a, a_z10_g), (a_z11_g, a_z11_w), a_z12_, a_z13_, a_z14_, a_z15_, (a_z16_v, a_z16_y), a_z17_, a_z18_, a_z19_
However, as you know, these variables were all grouped together and the same thing happened in the second regression table in #18.
This happened when the variables appear in the order from small absolute numbers.

The initial letters indicate single gene and the prefix numbers indicate genetic locations.
The locations are only those of my interest, and thus these prefix numbers are not ordered consecutively.
Since each gene has a different orientation, the order of appearance of the numbers (prefix numbers here) is not always same.

I also attached one of my data so that you can have a look at it.

I hope my reply is satisfiable.

Thanks a lot,
Attached Files

Statalist.xlsx (1.88 MB, 1 view)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10203
#21

24 Mar 2020, 00:15

Code:

rename (*_) (*_R)

Run the code, then

Code:

rename (*_R) (*_)
Comment
Yuki Ishikawa

Join Date: Mar 2019

Posts: 24
#22

24 Mar 2020, 15:18

Hi Andrew,

Thank you very much for your reply.
Your "add and remove" strategy worked perfectly. It is very simple and still basic, but a smart approach.

Many many thanks!!!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment