Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple Regression with large numbers of variables

    I'm attempting to perform multiple regression on large amounts of molecular data to model the association between age and expression of transcript isoforms during infection. The only issue I have is when setting the command, I don't want to manually type the name of 30,000 important isoforms into the command line. I'm new to stata, and am running on linux with no gui. So how do I set a vector/variable/argument which specifies column 2 to column 30,001 as the variables. I'm using Stata15.0


    Thanks in advance
    Jed
    Last edited by Jed Lye; 17 Jun 2021, 05:50.

  • #2
    if these are really contiguous, you can just put a hyphen between the names of the first and last variables; see
    Code:
    help varlist

    Comment


    • #3
      -help varlist- describes the various ways in which Stata's syntax allows you to compactly refer to large number of variables, which would be relevant if you intend your 30,000 variables as predictors. Note, however, that a regression command can't have more than 10,008 "RHS" (right hand side) variables in Stata. (-help limits-) If you instead want to run a large number of regression commands, with differing response variables, then that would require use of a loop, and presumably some mechanism to collect the various results. If the latter describes your situation, more information and some illustrative example data that gives a sense of the structure of your data set would help people here to help you.
      Last edited by Mike Lacy; 17 Jun 2021, 06:00. Reason: Crossed with Rich's comment.

      Comment


      • #4
        Originally posted by Mike Lacy View Post
        -help varlist- describes the various ways in which Stata's syntax allows you to compactly refer to large number of variables, which would be relevant if you intend your 30,000 variables as predictors. Note, however, that a regression command can't have more than 10,008 "RHS" (right hand side) variables in Stata. (-help limits-) If you instead want to run a large number of regression commands, with differing response variables, then that would require use of a loop, and presumably some mechanism to collect the various results. If the latter describes your situation, more information and some illustrative example data that gives a sense of the structure of your data set would help people here to help you.
        Thanks for the quick responses, the variables are unfortunately not named in a contiguous manner. If I run age as the independent variable, I am under the impression I can run 32,767 variables with 2bn data points is this correct?

        Here is a very small example of a similar data set with the same layout/formatting.

        Sample JUN IGHV1-24 TLK1 IGLV3-19 IGHV4-34 SLC44A1 Age sex
        Sample_201-1382 1.493243 2.514488 0.318296 7.972933 13.23432 0.5 21 m
        Sample_198-1379 0.653225 6.152193 0.476973 2.778084 3.610356 0.6 22 m
        Sample_190-1371 6.647803 52.63343 0.885215 252.117 220.6308 0.7 23 m
        Sample_195-1376 3.332472 7.56201 0.485464 7.715495 25.65717 0.8 24 m
        Sample_71-1231 1.187104 1.705307 0.6512 2.723539 8.940849 0.9 25 m
        Sample_203-1384 3.154672 25.18995 0.561193 14.58674 33.3969 1 26 m
        Sample_193-1374 3.384056 2.009544 0.55489 7.182655 84.13724 1.1 27 m
        Sample_194-1375 9.831789 7.127754 0.648386 273.4953 293.514 1.2 28 m
        Sample_78-1238 1.477036 0.267402 0.434809 0.507142 8.238429 1.3 29 m
        Sample_202-1383 6.39594 20.62537 0.816014 43.33996 31.41144 1.4 30 m
        Sample_80-1240 2.624942 5.586151 0.524026 11.63781 479.5456 1.5 31 m
        Sample_79-1239 3.486547 9.192847 0.701292 6.515076 20.86911 1.6 32 m
        Sample_76-1236 4.699618 10.02009 0.637157 27.69676 65.90197 1.7 33 m
        Sample_85-1245 2.119792 40.7076 0.653354 55.10773 153.0947 1.8 34 m
        Sample_88-1248 1.43349 0.46373 0.350934 0.732907 1.984322 1.3 35 f
        Sample_81-1241 1.584036 1.610277 0.561111 2.515037 4.572015 1.4 36 f
        Sample_63-1223 5.260055 0.10453 1.587185 1.982468 0.644096 1.5 37 f
        Sample_90-1250 4.169256 33.99086 0.673924 13.3082 13.22258 1.6 38 f
        Sample_89-1249 1.260589 0.215373 0.372151 2.246567 1.216502 1.7 39 f
        Sample_94-1254 6.398898 8.665698 0.951546 49.45157 59.11758 1.8 40 f

        Comment


        • #5
          You need to follow up on the advice that you are getting and read the references that you were supplied with, otherwise it is not going to work unless you just give us access to the server and we do your job instead of you.

          If you read the reference you were given on varlist, you will see that there are at least two ways how you can refer to many variables, one is by name abbreviations, say if you have age, age2, age3, age68, you can do

          Code:
          reg y age*
          or alternatively you can refer to variables which are in consecutive block, that is next to one another. Like this:

          Code:
          . sysuse auto
          (1978 Automobile Data)
          
          . reg price mpg-fore
          
                Source |       SS           df       MS      Number of obs   =        69
          -------------+----------------------------------   F(10, 58)       =      8.66
                 Model |   345416162        10  34541616.2   Prob > F        =    0.0000
              Residual |   231380797        58  3989324.09   R-squared       =    0.5989
          -------------+----------------------------------   Adj R-squared   =    0.5297
                 Total |   576796959        68  8482308.22   Root MSE        =    1997.3
          
          ------------------------------------------------------------------------------
                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                   mpg |  -21.80518    77.3599    -0.28   0.779    -176.6578    133.0475
                 rep78 |   184.7935   331.7921     0.56   0.580    -479.3606    848.9476
              headroom |  -635.4921   383.0243    -1.66   0.102    -1402.198    131.2142
                 trunk |   71.49929   95.05012     0.75   0.455    -118.7642    261.7628
                weight |   4.521161   1.411926     3.20   0.002     1.694884    7.347438
                length |  -76.49101   40.40303    -1.89   0.063    -157.3665     4.38444
                  turn |  -114.2777   123.5374    -0.93   0.359    -361.5646    133.0092
          displacement |   11.54012   8.378315     1.38   0.174    -5.230896    28.31115
            gear_ratio |  -318.6479    1124.34    -0.28   0.778    -2569.259    1931.964
               foreign |   3334.848   957.2253     3.48   0.001     1418.754    5250.943
                 _cons |   9789.494   6710.193     1.46   0.150    -3642.416     23221.4
          ------------------------------------------------------------------------------

          Comment


          • #6
            On the contrary, you have actually provided the answer in very simple terms without me needing to read the references. That is sort of the point of these wonderful forums. I can continue to work whilst democratizing a problem using electronic links to tacit human knowledge and understanding. Now I know exactly how to solve my problem, for which I thank you. But understand; this is by far the fastest way of problem solving. Only when we don't get explanations through phrasing and iterating our questions do we go away and *read the manual*.

            Ward regards,
            J

            Comment


            • #7
              Hi, I have a similar question regarding multiple regression in Stata. I am using Stata/SE 10.1. I would like to create a model to predict mhaq_score1 (my dependent variable - DV), with several independent variables (IV), namely age, bmi, and tcm_ever.

              All of these variables are continuous variables, except for tcm_ever, which is dichotomous categorical variable.

              My command "regress mhaq_score1 age bmi i.tcm_ever" yields the error message "i: operator invalid". How should I code the command for the categorical IV "tcm_ever"?

              Please assume that I have already checked for independence of residuals, linear relationship between DV and continuous IVs, homoscedasticity etc.

              Thank you

              Comment


              • #8
                what version of Stata are you using?

                Comment


                • #9
                  I am using Stata/SE 10.1

                  Comment


                  • #10
                    factor variable notation was not yet introduced so you need to precede your command with "xi: "; see
                    Code:
                    h xi

                    Comment

                    Working...
                    X