Multiple Regression with large numbers of variables

Jed Lye

Join Date: Jun 2021

Posts: 3
#1

Multiple Regression with large numbers of variables

17 Jun 2021, 05:48

I'm attempting to perform multiple regression on large amounts of molecular data to model the association between age and expression of transcript isoforms during infection. The only issue I have is when setting the command, I don't want to manually type the name of 30,000 important isoforms into the command line. I'm new to stata, and am running on linux with no gui. So how do I set a vector/variable/argument which specifies column 2 to column 30,001 as the variables. I'm using Stata15.0

Thanks in advance
Jed

Last edited by Jed Lye; 17 Jun 2021, 05:50.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4463
#2

17 Jun 2021, 05:55

if these are really contiguous, you can just put a hyphen between the names of the first and last variables; see

Code:

help varlist
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

17 Jun 2021, 05:59

-help varlist- describes the various ways in which Stata's syntax allows you to compactly refer to large number of variables, which would be relevant if you intend your 30,000 variables as predictors. Note, however, that a regression command can't have more than 10,008 "RHS" (right hand side) variables in Stata. (-help limits-) If you instead want to run a large number of regression commands, with differing response variables, then that would require use of a loop, and presumably some mechanism to collect the various results. If the latter describes your situation, more information and some illustrative example data that gives a sense of the structure of your data set would help people here to help you.

Last edited by Mike Lacy; 17 Jun 2021, 06:00. Reason: Crossed with Rich's comment.
Comment

Jed Lye

Join Date: Jun 2021
Posts: 3

17 Jun 2021, 06:13

Originally posted by Mike Lacy View Post

-help varlist- describes the various ways in which Stata's syntax allows you to compactly refer to large number of variables, which would be relevant if you intend your 30,000 variables as predictors. Note, however, that a regression command can't have more than 10,008 "RHS" (right hand side) variables in Stata. (-help limits-) If you instead want to run a large number of regression commands, with differing response variables, then that would require use of a loop, and presumably some mechanism to collect the various results. If the latter describes your situation, more information and some illustrative example data that gives a sense of the structure of your data set would help people here to help you.

Thanks for the quick responses, the variables are unfortunately not named in a contiguous manner. If I run age as the independent variable, I am under the impression I can run 32,767 variables with 2bn data points is this correct?

Here is a very small example of a similar data set with the same layout/formatting.

Sample	JUN	IGHV1-24	TLK1	IGLV3-19	IGHV4-34	SLC44A1	Age	sex
Sample_201-1382	1.493243	2.514488	0.318296	7.972933	13.23432	0.5	21	m
Sample_198-1379	0.653225	6.152193	0.476973	2.778084	3.610356	0.6	22	m
Sample_190-1371	6.647803	52.63343	0.885215	252.117	220.6308	0.7	23	m
Sample_195-1376	3.332472	7.56201	0.485464	7.715495	25.65717	0.8	24	m
Sample_71-1231	1.187104	1.705307	0.6512	2.723539	8.940849	0.9	25	m
Sample_203-1384	3.154672	25.18995	0.561193	14.58674	33.3969	1	26	m
Sample_193-1374	3.384056	2.009544	0.55489	7.182655	84.13724	1.1	27	m
Sample_194-1375	9.831789	7.127754	0.648386	273.4953	293.514	1.2	28	m
Sample_78-1238	1.477036	0.267402	0.434809	0.507142	8.238429	1.3	29	m
Sample_202-1383	6.39594	20.62537	0.816014	43.33996	31.41144	1.4	30	m
Sample_80-1240	2.624942	5.586151	0.524026	11.63781	479.5456	1.5	31	m
Sample_79-1239	3.486547	9.192847	0.701292	6.515076	20.86911	1.6	32	m
Sample_76-1236	4.699618	10.02009	0.637157	27.69676	65.90197	1.7	33	m
Sample_85-1245	2.119792	40.7076	0.653354	55.10773	153.0947	1.8	34	m
Sample_88-1248	1.43349	0.46373	0.350934	0.732907	1.984322	1.3	35	f
Sample_81-1241	1.584036	1.610277	0.561111	2.515037	4.572015	1.4	36	f
Sample_63-1223	5.260055	0.10453	1.587185	1.982468	0.644096	1.5	37	f
Sample_90-1250	4.169256	33.99086	0.673924	13.3082	13.22258	1.6	38	f
Sample_89-1249	1.260589	0.215373	0.372151	2.246567	1.216502	1.7	39	f
Sample_94-1254	6.398898	8.665698	0.951546	49.45157	59.11758	1.8	40	f

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

17 Jun 2021, 06:41

You need to follow up on the advice that you are getting and read the references that you were supplied with, otherwise it is not going to work unless you just give us access to the server and we do your job instead of you.

If you read the reference you were given on varlist, you will see that there are at least two ways how you can refer to many variables, one is by name abbreviations, say if you have age, age2, age3, age68, you can do

Code:

reg y age*

or alternatively you can refer to variables which are in consecutive block, that is next to one another. Like this:

Code:

. sysuse auto
(1978 Automobile Data)

. reg price mpg-fore

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(10, 58)       =      8.66
       Model |   345416162        10  34541616.2   Prob > F        =    0.0000
    Residual |   231380797        58  3989324.09   R-squared       =    0.5989
-------------+----------------------------------   Adj R-squared   =    0.5297
       Total |   576796959        68  8482308.22   Root MSE        =    1997.3

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -21.80518    77.3599    -0.28   0.779    -176.6578    133.0475
       rep78 |   184.7935   331.7921     0.56   0.580    -479.3606    848.9476
    headroom |  -635.4921   383.0243    -1.66   0.102    -1402.198    131.2142
       trunk |   71.49929   95.05012     0.75   0.455    -118.7642    261.7628
      weight |   4.521161   1.411926     3.20   0.002     1.694884    7.347438
      length |  -76.49101   40.40303    -1.89   0.063    -157.3665     4.38444
        turn |  -114.2777   123.5374    -0.93   0.359    -361.5646    133.0092
displacement |   11.54012   8.378315     1.38   0.174    -5.230896    28.31115
  gear_ratio |  -318.6479    1124.34    -0.28   0.778    -2569.259    1931.964
     foreign |   3334.848   957.2253     3.48   0.001     1418.754    5250.943
       _cons |   9789.494   6710.193     1.46   0.150    -3642.416     23221.4
------------------------------------------------------------------------------

Comment

Jed Lye

Join Date: Jun 2021

Posts: 3
#6

17 Jun 2021, 07:02

On the contrary, you have actually provided the answer in very simple terms without me needing to read the references. That is sort of the point of these wonderful forums. I can continue to work whilst democratizing a problem using electronic links to tacit human knowledge and understanding. Now I know exactly how to solve my problem, for which I thank you. But understand; this is by far the fastest way of problem solving. Only when we don't get explanations through phrasing and iterating our questions do we go away and *read the manual*.

Ward regards,
J
Comment
Mesut Ozil

Join Date: May 2019

Posts: 7
#7

19 Nov 2022, 01:13

Hi, I have a similar question regarding multiple regression in Stata. I am using Stata/SE 10.1. I would like to create a model to predict mhaq_score1 (my dependent variable - DV), with several independent variables (IV), namely age, bmi, and tcm_ever.

All of these variables are continuous variables, except for tcm_ever, which is dichotomous categorical variable.

My command "regress mhaq_score1 age bmi i.tcm_ever" yields the error message "i: operator invalid". How should I code the command for the categorical IV "tcm_ever"?

Please assume that I have already checked for independence of residuals, linear relationship between DV and continuous IVs, homoscedasticity etc.

Thank you
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4463
#8

19 Nov 2022, 06:11

what version of Stata are you using?
Comment
Mesut Ozil

Join Date: May 2019

Posts: 7
#9

19 Nov 2022, 06:15

I am using Stata/SE 10.1
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4463
#10

19 Nov 2022, 06:38

factor variable notation was not yet introduced so you need to precede your command with "xi: "; see

Code:

h xi
Comment

Announcement